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Preface 



This book is aimed at senior undergraduates and graduate students in Engi- 
neering, Science, Mathematics, and Computing. It expects familiarity with 
calculus, probability theory, and linear algebra as taught in a first- or second- 
year undergraduate course on mathematics for scientists and engineers. 

Conventional courses on information theory cover not only the beauti- 
ful theoretical ideas of Shannon, but also practical solutions to communica- 
tion problems. This book goes further, bringing in Bayesian data modelling, 
Monte Carlo methods, variational methods, clustering algorithms, and neural 
networks. 

Why unify information theory and machine learning? Because they are 
two sides of the same coin. In the 1960s, a single field, cybernetics, was 
populated by information theorists, computer scientists, and neuroscientists, 
all studying common problems. Information theory and machine learning still 
belong together. Brains are the ultimate compression and communication 
systems. And the state-of-the-art algorithms for both data compression and 
error-correcting codes use the same tools as machine learning. 

How to use this book 

The essential dependencies between chapters are indicated in the figure on the 
next page. An arrow from one chapter to another indicates that the second 
chapter requires some of the first. 

Within Parts I, II, IV, and V of this book, chapters on advanced or optional 
topics are towards the end. All chapters of Part III are optional on a first 
reading, except perhaps for Chapter 16 (Message Passing). 

The same system sometimes applies within a chapter: the final sections of- 
ten deal with advanced topics that can be skipped on a first reading. For exam- 
ple in two key chapters - Chapter 4 (The Source Coding Theorem) and Chap- 
ter 10 (The Noisy-Channel Coding Theorem) - the first-time reader should 
detour at section 4.5 and section 10.4 respectively 

Pages vii-x show a few ways to use this book. First, I give the roadmap for 
a course that I teach in Cambridge: 'Information theory, pattern recognition, 
and neural networks'. The book is also intended as a textbook for traditional 
courses in information theory. The second roadmap shows the chapters for an 
introductory information theory course and the third for a course aimed at an 
understanding of state-of-the-art error-correcting codes. The fourth roadmap 
shows how to use the text in a conventional course on machine learning. 



v 
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Preface xi 
About the exercises 

You can understand a subject only by creating it for yourself. The exercises 
play an essential role in this book. For guidance, each has a rating (similar to 
that used by Knuth (1968)) from 1 to 5 to indicate its difficulty 

In addition, exercises that are especially recommended are marked by a 
marginal encouraging rat. Some exercises that require the use of a computer 
are marked with a C. 

Answers to many exercises are provided. Use them wisely. Where a solu- 
tion is provided, this is indicated by including its page number alongside the 
difficulty rating. 

Solutions to many of the other exercises will be supplied to instructors 
using this book in their teaching; please email solutions@cambridge.org. 



Summary 


of codes for exercises 








Especially recommended 


[1] 


Simple (one minute) 






[2] 


Medium (quarter hour) 


> 


Recommended 


[3] 


Moderately hard 


C 


Parts require a computer 


[4] 


Hard 


[p. 42] 


Solution provided on page 42 


[5] 


Research project 



Internet resources 

The website 

http : //www . inf erence . phy . cam . ac . uk/mackay/ it ila 

contains several resources: 

1. Software. Teaching software that I use in lectures, interactive software, 
and research software, written in perl, octave, tcl, C, and gnuplot. 
Also some animations. 

2. Corrections to the book. Thank you in advance for emailing these! 

3. This book. The book is provided in postscript, pdf , and djvu formats 
for on-screen viewing. The same copyright restrictions apply as to a 
normal book. 

About this edition 

This is the third printing of the first edition. In the second printing, the 
design of the book was altered slightly. Page-numbering generally remains 
unchanged, except in chapters 1, 6, and 28, where a few paragraphs, figures, 
and equations have moved around. All equation, section, and exercise numbers 
are unchanged. In the third printing, chapter 8 has been renamed 'Dependent 
Random Variables', instead of 'Correlated', which was sloppy. 
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About Chapter 1 



In the first chapter, you will need to be familiar with the binomial distribution. 
And to solve the exercises in the text - which I urge you to do - you will need 
to know Stirling's approximation for the factorial function, x\ ~ x x e~ x , and 
be able to apply it to (^f) = ^y^Tr!" These topics are reviewed below. 

The binomial distribution 

Example 1.1. A bent coin has probability / of coming up heads. The coin is 
tossed N times. What is the probability distribution of the number of 
heads, r? What are the mean and variance of r? 



Unfamiliar notation? 
See Appendix A, p. 598. 



Solution. The number of heads has a binomial distribution. 



P(r\f,N) 



N 



f(l-f) 



N-r 



(1.1) 



The mean, £[r], and variance, var[r], of this distribution are defined by 



N 



£[r] = Y,P{r\f,N); 



r=0 



0.3 - 

0.25 - 

0.2 - 

0.15 - 

0.1 - 

0.05 - 

- 



0123456789 10 

r 



Figure 1.1. The binomial 
( L2 ) distribution P(r \ f = 0.3, N = 10). 



var[r] = £ [(r - £[r]) 2 ] (1.3) 

N 

= £[r 2 ]-(£[r]) 2 = Y,P(r\f,N)r 2 -(£[r}) 2 . (1.4) 



r=0 



Rather than evaluating the sums over r in (1.2) and (1.4) directly, it is easiest 
to obtain the mean and variance by noting that r is the sum of N independent 
random variables, namely, the number of heads in the first toss (which is either 
zero or one), the number of heads in the second toss, and so forth. In general, 



£[x + y] = £[x] + £[y] for any random variables x and y; 
var[x + y] = var[x] + var[y] if x and y are independent. 



(1.5) 



So the mean of r is the sum of the means of those random variables, and the 
variance of r is the sum of their variances. The mean number of heads in a 
single toss is / x 1 + (1 — /) x = /, and the variance of the number of heads 
in a single toss is 

[/ x I 2 + (1 - /) x 2 ] - f = f - f = f(l - /), (1.6) 

so the mean and variance of r are: 

£[r]=Nf and var[r] = Nf(l - /). □ (1.7) 



1 
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About Chapter 1 



Approximating x\ and (^) 



Let's derive Stirling's approximation by an unconventional route. We start 
from the Poisson distribution with mean A, 



P(r|A) = e- A ^ re {0,1,2,...}. 



(1.8) 



0.1 
0.08 
0.06 - 
0.04 - 
0.02 




For large A, this distribution is well approximated - at least in the vicinity of 
r ~ A - by a Gaussian distribution with mean A and variance A: 



5 10 15 20 25 

r 



—A 



Let's plug r = A into this formula. 



1 ( r - x r 
2A 



—A 



A 



A_ 
A! 



^>A! ~ X x e- x V2^X. 



This is Stirling's approximation for the factorial function. 

^2ixx In x\ ~ x In x — x + ^ In 2nx. 



(1.9) 

(1.10) 
(1.11) 

(1.12) 



Figure 1.2. The Poisson 
distribution P(r | A= 15). 



We have derived not only the leading order behaviour, x\ c± x x e x , but also, 



at no cost, the next-order correction term \ / 2nx. We now apply Stirling's 
approximation to In (^f) : 



In 



In 



AH 



(N-r)\r\ 



N N 
(N-r)ln— + rln— . (1.13) 

v N - r r v ' 



Since all the terms in this equation are logarithms, this result can be rewritten 
in any base. We will denote natural logarithms (log e ) by 'In', and logarithms Recall that log 2 a; 
to base 2 (log 2 ) by 'log'. 

If we introduce the binary entropy function, Note that 



log e x 
d log 2 x 11 



dx 



log, 2 x ' 



H 2 (x) = a; log - + (l-a;)log 1 . 

x (1 — x) 

then we can rewrite the approximation (1.13) as 



log 



or, equivalently, 



NH 2 {r/N), 



^ 2 NH 2(r/N) 



(1.14) 



(1.15) 



(1.16) 



ff 2 (a;) 




0.2 0.4 



If we need a more accurate approximation, we can include terms of the next Figure 1.3. The binary entropy 
order from Stirling's approximation (1.12): function. 



log 



NH 2 (r/N) - ilog 



2ttN • 



N — r r 

IsTn 



(1.17) 
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Introduction to Information Theory 



The fundamental problem of is that of reproducing at one point ei- 
ther exactly or approximately a message selected at another point. 

(Claude Shannon, 1948) 

In the first half of this book we study how to measure information content; we 
learn how to compress data; and we learn how to communicate perfectly over 
imperfect communication channels. 

We start by getting a feeling for this last problem. 

► 1.1 How can we achieve perfect communication over an imperfect, 
noisy communication channel? 

Some examples of noisy communication channels are: 

• an analogue telephone line, over which two modems communicate digital 
information; 

• the radio communication link from Galileo, the Jupiter-orbiting space- 
craft, to earth; 

• reproducing cells, in which the daughter cells' DNA contains information 
from the parent cells; 

• a disk drive. 

The last example shows that communication doesn't have to involve informa- 
tion going from one place to another. When we write a file on a disk drive, 
we'll read it off in the same location - but at a later time. 

These channels are noisy. A telephone line suffers from cross-talk with 
other lines; the hardware in the line distorts and adds noise to the transmitted 
signal. The deep space network that listens to Galileo's puny transmitter 
receives background radiation from terrestrial and cosmic sources. DNA is 
subject to mutations and damage. A disk drive, which writes a binary digit 
(a one or zero, also known as a bit) by aligning a patch of magnetic material 
in one of two orientations, may later fail to read out the stored binary digit: 
the patch of material might spontaneously flip magnetization, or a glitch of 
background noise might cause the reading circuit to report the wrong value 
for the binary digit, or the writing head might not induce the magnetization 
in the first place because of interference from neighbouring bits. 

In all these cases, if we transmit data, e.g., a string of bits, over the channel, 
there is some probability that the received message will not be identical to the 
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transmitted message. We would prefer to have a communication channel for 
which this probability was zero - or so close to zero that for practical purposes 
it is indistinguishable from zero. 

Let's consider a noisy disk drive that transmits each bit correctly with 
probability (1 — /) and incorrectly with probability /. This model communi- 
cation channel is known as the binary symmetric channel (figure 1 .4) . 



x°V% P(y=o\* = o) = i-/; P(y=o\x=i) = /; 
i^i 



P(y = l\ X = 0) 



/; 



P(y = l\x = l) 



Figure 1.4. The binary symmetric 
channel. The transmitted symbol 
j is x and the received symbol y. 

The noise level, the probability 
that a bit is flipped, is /. 





(i-/) 




Figure 1.5. A binary data 
sequence of length 10 000 
transmitted over a binary 
symmetric channel with noise 
level / = 0.1. [Dilbert image 
Copyright©1997 United Feature 
Syndicate, Inc., used with 
permission.] 



As an example, let's imagine that / = 0.1, that is, ten per cent of the bits are 
flipped (figure 1.5). A useful disk drive would flip no bits at all in its entire 
lifetime. If we expect to read and write a gigabyte per day for ten years, we 
require a bit error probability of the order of 10~ 15 , or smaller. There are two 
approaches to this goal. 



The physical solution 

The physical solution is to improve the physical characteristics of the commu- 
nication channel to reduce its error probability. We could improve our disk 
drive by 

1 . using more reliable components in its circuitry; 

2. evacuating the air from the disk enclosure so as to eliminate the turbu- 
lence that perturbs the reading head from the track; 

3. using a larger magnetic patch to represent each bit; or 

4. using higher-power signals or cooling the circuitry in order to reduce 
thermal noise. 

These physical modifications typically increase the cost of the communication 
channel. 



The 'system' solution 

Information theory and coding theory offer an alternative (and much more ex- 
citing) approach: we accept the given noisy channel as it is and add communi- 
cation systems to it so that we can detect and correct the errors introduced by 
the channel. As shown in figure 1.6, we add an encoder before the channel and 
a decoder after it. The encoder encodes the source message s into a transmit- 
ted message t, adding redundancy to the original message in some way. The 
channel adds noise to the transmitted message, yielding a received message r. 
The decoder uses the known redundancy introduced by the encoding system 
to infer both the original signal s and the added noise. 
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Source 



Encoder 



Decoder 



Noisy 
channel 



Figure 1.6. The 'system' solution 
for achieving reliable 
communication over a noisy 
channel. The encoding system 
introduces systematic redundancy 
into the transmitted vector t. The 
decoding system uses this known 
redundancy to deduce from the 
received vector r both the original 
source vector and the noise 
introduced by the channel. 



Whereas physical solutions give incremental channel improvements only at 
an ever-increasing cost, system solutions can turn noisy channels into reliable 
communication channels with the only cost being a computational requirement 
at the encoder and decoder. 

Information theory is concerned with the theoretical limitations and po- 
tentials of such systems. 'What is the best error-correcting performance we 
could achieve?' 

Coding theory is concerned with the creation of practical encoding and 
decoding systems. 



► 1.2 Error-correcting codes for the binary symmetric channel 

We now consider examples of encoding and decoding systems. What is the 
simplest way to add useful redundancy to a transmission? [To make the rules 
of the game clear: we want to be able to detect and correct errors; and re- 
transmission is not an option. We get only one chance to encode, transmit, 
and decode.] 



Repetition codes 

A straightforward idea is to repeat every bit of the message a prearranged 
number of times - for example, three times, as shown in table 1.7. We call 
this repetition code 'R3'. 

Imagine that we transmit the source message 

s = 1 1 1 

over a binary symmetric channel with noise level / = 0.1 using this repetition 
code. We can describe the channel as 'adding' a sparse noise vector n to the 
transmitted vector - adding in modulo 2 arithmetic, i.e., the binary algebra 
in which 1+1=0. A possible noise vector n and received vector r = t + n are 
shown in figure 1.8. 



Source 


Transmitted 


sequence 


sequence 


s 


t 





000 


1 


111 



Table 1.7. The repetition code R3. 





000 





000 
001 



111 
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000 
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10 1 



111 

000 





000 
000 



n 000 

r 000 001 111 000 010 111 000 



Figure 1.8. An example 
transmission using R 3 . 



How should we decode this received vector? The optimal algorithm looks 
at the received bits three at a time and takes a majority vote (algorithm 1.9). 
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Received sequence r 



Likelihood ratio ^ j s = o) 



Decoded sequence s 



000 


7 -3 





001 


T l 





010 


T l 





100 


T l 





101 


7 1 


1 


110 


7 1 
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011 


7 1 


1 


111 


7 3 


1 



Algorithm 1.9. Majority-vote 
decoding algorithm for R3. Also 
shown arc the likelihood ratios 
(1.23), assuming the channel is a 
binary symmetric channel; 

7 =(!-/)//■ 



At the risk of explaining the obvious, let's prove this result. The optimal 
decoding decision (optimal in the sense of having the smallest probability of 
being wrong) is to find which value of s is most probable, given r. Consider 
the decoding of a single bit s, which was encoded as t(s) and gave rise to three 
received bits r = r\r 2 rz. By Bayes' theorem, the posterior probability of s is 

, , n P(nr 2 r 3 1 s)P(s) 

P(rir 2 r 3 ) 

We can spell out the posterior probability of the two alternatives thus: 

P(s = l | nr 2 r 3 ) = — ; (1.19) 

P(. = 0|r 1 ^) = J>(ri ^'' = 0) f ( ' = 0) . (1.20) 
P{rir 2 r3) 

This posterior probability is determined by two factors: the prior probability 
P(s), and the data-dependent term P(rir 2 r 3 \ s), which is called the likelihood 
of s. The normalizing constant P(rir 2 r 3 ) needn't be computed when finding the 
optimal decoding decision, which is to guess s = if P(s = | r) > P(s = 1 1 r), 
and s = 1 otherwise. 

To find P(s = I r) and P(s = 1 1 r), we must make an assumption about the 
prior probabilities of the two hypotheses s = and s = 1, and we must make an 
assumption about the probability of r given s. We assume that the prior prob- 
abilities are equal: P(s = 0) = P(s = 1) = 0.5; then maximizing the posterior 
probability P(s \ r) is equivalent to maximizing the likelihood P(r \ s). And we 
assume that the channel is a binary symmetric channel with noise level / < 0.5, 
so that the likelihood is 

N 

P(v\s) = P(v\t(s))=l[P(r n \t n (s)), (1.21) 

n=l 

where N = 3 is the number of transmitted bits in the block we arc considering, 
and 

^M = {f- f) J r ; n = \i (1.22) 

Thus the likelihood ratio for the two hypotheses is 

P(r\s = l) = fr P(r n \tn(l)) . . . 

P{v\s = Q) l = lp(r n |t n (0))' 1 • > 

each factor p[r"jt"[o)j equals if r n = 1 and prjj) ^ r " = ^- The ratio 

7 = is greater than 1, since / < 0.5, so the winning hypothesis is the 

one with the most 'votes', each vote counting for a factor of 7 in the likelihood 
ratio. 
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Thus the majority- vote decoder shown in algorithm 1.9 is the optimal decoder 
if we assume that the channel is a binary symmetric channel and that the two 
possible source messages and 1 have equal prior probability. 



We now apply the majority vote decoder to the received vector of figure 1.8. 
The first three received bits are all 0, so we decode this triplet as a 0. In the 
second triplet of figure 1.8, there are two 0s and one 1, so we decode this triplet 
as a - which in this case corrects the error. Not all errors are corrected, 
however. If we are unlucky and two errors fall in a single block, as in the fifth 
triplet of figure 1.8, then the decoding rule gets the wrong answer, as shown 
in figure 1.10. 
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n 
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111 
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000 
000 
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1 



00_0 




01_0 




11J. 
1 



00_0 





Figure 1.10. Decoding the received 
vector from figure 1.8. 



Exercise 1.2.^' p l6 l Show that the error probability is reduced by the use of 
R,3 by computing the error probability of this code for a binary symmetric 
channel with noise level /. 

The error probability is dominated by the probability that two bits in 
a block of three are flipped, which scales as f 2 . In the case of the binary 
symmetric channel with / = 0.1, the R3 code has a probability of error, after 
decoding, of pb — 0.03 per bit. Figure 1.11 shows the result of transmitting a 
binary image over a binary symmetric channel using the repetition code. 



The exercise's rating, e.g.' [2]', 
indicates its difficulty: '1' 
exercises are the easiest. Exercises 
that are accompanied by a 
marginal rat are especially 
recommended. If a solution or 
partial solution is provided, the 
page is indicated after the 
difficulty rating; for example, this 
exercise's solution is on page 16. 



ENCODER 



CHANNEL 
/ = 10% 



DECODER 








Figure 1.11. Transmitting 10 000 
source bits over a binary 
symmetric channel with / = 10% 
using a repetition code and the 
majority vote decoding algorithm. 
The probability of decoded bit 
error has fallen to about 3%; the 
rate has fallen to 1/3. 
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Figure 1.12. Error probability pt 
versus rate for repetition codes 
over a binary symmetric channel 
with / = 0.1. The right-hand 
figure shows p b on a logarithmic 
scale. We would like the rate to 
be large and pb to be small. 



The repetition code R3 has therefore reduced the probability of error, as 
desired. Yet we have lost something: our rate of information transfer has 
fallen by a factor of three. So if we use a repetition code to communicate data 
over a telephone line, it will reduce the error frequency, but it will also reduce 
our communication rate. We will have to pay three times as much for each 
phone call. Similarly, we would need three of the original noisy gigabyte disk 
drives in order to create a one-gigabyte disk drive with = 0.03. 

Can we push the error probability lower, to the values required for a sell- 
able disk drive - 10~ 15 ? We could achieve lower error probabilities by using 
repetition codes with more repetitions. 

3 [3, p-16] ^ g now that the probability of error of Rjv, the repe- 
tition code with N repetitions, is 

Pb= £ PV(1 -/)""", (1-24) 

n=(JV+l)/2 ^ ' 

for odd N. 

Assuming / = 0.1, which of the terms in this sum is the biggest? 
How much bigger is it than the second-biggest term? 

Use Stirling's approximation (p. 2) to approximate the ( ) in the 
largest term, and find, approximately, the probability of error of 
the repetition code with N repetitions. 

Assuming / = 0.1, find how many repetitions are required to get 
the probability of error down to 10~ 15 . [Answer: about 60.] 

So to build a single gigabyte disk drive with the required reliability from noisy 
gigabyte drives with / = 0.1, we would need sixty of the noisy disk drives. 
The tradeoff between error probability and rate for repetition codes is shown 
in figure 1.12. 

Block codes - the (7, 4) Hamming code 

We would like to communicate with tiny probability of error and at a substan- 
tial rate. Can we improve on repetition codes? What if we add redundancy to 
blocks of data instead of encoding one bit at a time? We now study a simple 
block code. 



Exercise U 



(b) 
(c) 



(d) 
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1.2: Error-correcting codes for the binary symmetric channel 



A block code is a rule for converting a sequence of source bits s, of length 
K, say, into a transmitted sequence t of length N bits. To add redundancy, 
we make N greater than K. In a linear block code, the extra N — K bits are 
linear functions of the original K bits; these extra bits are called parity-check 
bits. An example of a linear block code is the (7, 4) Hamming code, which 
transmits N = 7 bits for every K = 4 source bits. 




Figure 1.13. Pictorial 
representation of encoding for the 
(7, 4) Hamming code. 



The encoding operation for the code is shown pictorially in figure 1.13. We 
arrange the seven transmitted bits in three intersecting circles. The first four 
transmitted bits, £1*2*3*4; are set equal to the four source bits, S1S2S3S4. The 
parity-check bits £5*6*7 are set so that the parity within each circle is even: 
the first parity-check bit is the parity of the first three source bits (that is, it 
is if the sum of those bits is even, and 1 if the sum is odd); the second is 
the parity of the last three; and the third parity bit is the parity of source bits 
one, three and four. 

As an example, figure 1.13b shows the transmitted codeword for the case 
s = 1000. Table 1.14 shows the codewords generated by each of the 2 4 = 
sixteen settings of the four source bits. These codewords have the special 
property that any pair differ from each other in at least three bits. 
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0000 


0000000 


0100 


0100110 
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1000101 


1100 


1100011 


0001 


0001011 


0101 


0101101 


1001 
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1101 
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0010111 


0110 


0110001 


1010 


1010010 


1110 


1110100 


0011 


0011100 


0111 


0111010 


1011 


1011001 


1111 


1111111 



Table 1.14. The sixteen codewords 
{t} of the (7, 4) Hamming code. 
Any pair of codewords differ from 
each other in at least three bits. 



Because the Hamming code is a linear code, it can be written compactly in 
terms of matrices as follows. The transmitted codeword t is obtained from the 
source sequence s by a linear operation, 



t = G T s, 

where G is the generator matrix of the code, 



G T 



(1.25) 



1 














1 














1 














1 


1 


1 


1 








1 


1 


1 


1 





1 


1 



(1.26) 



and the encoding operation (1.25) uses modulo-2 arithmetic (1 + 1 = 0,0 + 1 = 
1, etc.). 

In the encoding operation (1.25) I have assumed that s and t arc column vectors. 
If instead they are row vectors, then this equation is replaced by 



t = sG, 



(1.27) 
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where 



G = 



1 








1 









1 









1 



1 
1 
1 







1 
1 
1 



1 



1 
1 



(1.28) 



I find it easier to relate to the right-multiplication (1.25) than the left-multiplica- 
tion (1.27). Many coding theory texts use the left-multiplying conventions 
(1.27-1.28), however. 

The rows of the generator matrix (1.28) can be viewed as denning four basis 
vectors lying in a seven-dimensional binary space. The sixteen codewords are 
obtained by making all possible linear combinations of these vectors. 

Decoding the (7,4) Hamming code 

When we invent a more complex encoder s — > t, the task of decoding the 
received vector r becomes less straightforward. Remember that any of the 
bits may have been flipped, including the parity bits. 

If we assume that the channel is a binary symmetric channel and that all 
source vectors are equiprobable, then the optimal decoder identifies the source 
vector s whose encoding t(s) differs from the received vector r in the fewest 
bits. [Refer to the likelihood function (1.23) to see why this is so.] We could 
solve the decoding problem by measuring how far r is from each of the sixteen 
codewords in table 1.14, then picking the closest. Is there a more efficient way 
of finding the most probable source vector? 

Syndrome decoding for the Hamming code 

For the (7, 4) Hamming code there is a pictorial solution to the decoding 
problem, based on the encoding picture, figure 1.13. 

As a first example, let's assume the transmission was t = 1000101 and the 
noise flips the second bit, so the received vector is r = 1000101 © 0100000 = 
1100101. We write the received vector into the three circles as shown in 
figure 1.15a, and look at each of the three circles to see whether its parity 
is even. The circles whose parity is not even are shown by dashed lines in 
figure 1.15b. The decoding task is to find the smallest set of flipped bits that 
can account for these violations of the parity rules. [The pattern of violations 
of the parity checks is called the syndrome, and can be written as a binary 
vector - for example, in figure 1.15b, the syndrome is z = (1, 1,0), because 
the first two circles are 'unhappy' (parity 1) and the third circle is 'happy' 
(parity 0).] 

To solve the decoding task, we ask the question: can we find a unique bit 
that lies inside all the 'unhappy' circles and outside all the 'happy' circles? If 
so, the flipping of that bit would account for the observed syndrome. In the 
case shown in figure 1.15b, the bit T2 lies inside the two unhappy circles and 
outside the happy circle; no other single bit has this property, so r2 is the only 
single bit capable of explaining the syndrome. 

Let's work through a couple more examples. Figure 1.15c shows what 
happens if one of the parity bits, £5, is flipped by the noise. Just one of the 
checks is violated. Only r$ lies inside this unhappy circle and outside the other 
two happy circles, so r§ is identified as the only single bit capable of explaining 
the syndrome. 

If the central bit r% is received flipped, figure 1.15d shows that all three 
checks are violated; only r 3 lies inside all three circles, so r 3 is identified as 
the suspect bit. 
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Figure 1.15. Pictorial 
representation of decoding of the 
Hamming (7, 4) code. The 
received vector is written into the 
diagram as shown in (a). In 
(b,c,d,e), the received vector is 
shown, assuming that the 
transmitted vector was as in 
figure 1.13b and the bits labelled 
by * were flipped. The violated 
parity checks are highlighted by 
dashed circles. One of the seven 
bits is the most probable suspect 
to account for each 'syndrome', 
i.e., each pattern of violated and 
satisfied parity checks. 
In examples (b), (c), and (d), the 
most probable suspect is the one 
bit that was flipped. 
In example (e), two bits have been 
flipped, s 3 and tj. The most 
probable suspect is r'2, marked by 
a circle in (c'), which shows the 
output of the decoding algorithm. 
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If you try nipping any one of the seven bits, you'll find that a different 
syndrome is obtained in each case - seven non-zero syndromes, one for each 
bit. There is only one other syndrome, the all-zero syndrome. So if the 
channel is a binary symmetric channel with a small noise level /, the optimal 
decoder unflips at most one bit, depending on the syndrome, as shown in 
algorithm 1.16. Each syndrome could have been caused by other noise patterns 
too, but any other noise pattern that has the same syndrome must be less 
probable because it involves a larger number of noise events. 

What happens if the noise actually flips more than one bit? Figure 1.15e 
shows the situation when two bits, r% and rr, are received flipped. The syn- 
drome, 110, makes us suspect the single bit r% so our optimal decoding al- 
gorithm flips this bit, giving a decoded pattern with three errors as shown 
in figure 1.15e'. If we use the optimal decoding algorithm, any two-bit error 
pattern will lead to a decoded seven-bit vector that contains three errors. 



Algorithm 1.16. Actions taken by 
the optimal decoder for the (7, 4) 
Hamming code, assuming a 
binary symmetric channel with 
small noise level /. The syndrome 
vector z lists whether each parity 
check is violated (l) or satisfied 
(0), going through the checks in 
the order of the bits r 5 , r 6 , and r 7 . 



General view of decoding for linear codes: syndrome decoding 

We can also describe the decoding problem for a linear code in terms of matrices. 
The first four received bits, r 1 r 2 r 3 r4, purport to be the four source bits; and the 
received bits r^T^ri purport to be the parities of the source bits, as defined by 
the generator matrix G. We evaluate the three parity-check bits for the received 
bits, rxrirzTA, and see whether they match the three received bits, r^r^r-i. The 
differences (modulo 2) between these two triplets are called the syndrome of the 
received vector. If the syndrome is zero - if all three parity checks are happy 
- then the received vector is a codeword, and the most probable decoding is 
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given by reading out its first four bits. If the syndrome is non-zero, then the 
noise sequence for this block was non-zero, and the syndrome is our pointer to 
the most probable error pattern. 

The computation of the syndrome vector is a linear operation. If we define the 
3x4 matrix P such that the matrix of equation (1.26) is 



G T ; 



I 4 
P 



(1.29) 



where I4 is the 4x4 identity matrix, then the syndrome vector is z = Hr, 
where the parity-check matrix H is given by H = [ — P I3 ] ; in modulo 2 
arithmetic, — 1 = 1, so 



H = [ P I 3 



1110 10 
1110 10 
10 110 1 



(1.30) 



All the codewords t = G T s of the code satisfy 



Ht 



(1.31) 



Exercise 1.4.^ Prove that this is so by evaluating the 3x4 matrix HG T . 

Since the received vector r is given by r = G T s + n, the syndrome-decoding 
problem is to find the most probable noise vector n satisfying the equation 



Hn 



(1.32) 



A decoding algorithm that solves this problem is called a maximum-likelihood 
decoder. We will discuss decoding problems like this in later chapters. 



Figure 1.17. Transmitting 10 000 
source bits over a binary 
symmetric channel with / = 10% 
using a (7, 4) Hamming code. The 
probability of decoded bit error is 
about 7%. 



Summary of the (7,4) Hamming code's properties 

Every possible received vector of length 7 bits is either a codeword, or it's one 
flip away from a codeword. 

Since there are three parity constraints, each of which might or might not 
be violated, there are 2x2x2 = 8 distinct syndromes. They can be divided 
into seven non-zero syndromes - one for each of the one-bit error patterns - 
and the all-zero syndrome, corresponding to the zero-noise case. 

The optimal decoder takes no action if the syndrome is zero, otherwise it 
uses this mapping of non-zero syndromes onto one-bit error patterns to unffip 
the suspect bit. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



1.2: Error-correcting codes for the binary symmetric channel 



There is a decoding error if the four decoded bits 81,82,83,84 do not all 
match the source bits si, S2, S3, S4. The probability of block error pb is the 
probability that one or more of the decoded bits in one block fail to match the 
corresponding source bits, 

p B = P(sj±s). (1.33) 

The probability of bit error pb is the average probability that a decoded bit 
fails to match the corresponding source bit, 



Pb = ]^Y J P{sk^s k ). (1.34) 

k=l 

In the case of the Hamming code, a decoding error will occur whenever 
the noise has nipped more than one bit in a block of seven. The probability 
of block error is thus the probability that two or more bits are flipped in a 
block. This probability scales as 0(f 2 ), as did the probability of error for the 
repetition code R3. But notice that the Hamming code communicates at a 
greater rate, R = 4/7. 

Figure 1.17 shows a binary image transmitted over a binary symmetric 
channel using the (7,4) Hamming code. About 7% of the decoded bits are 
in error. Notice that the errors are correlated: often two or three successive 
decoded bits are flipped. 

Exercise 1.5. ^ This exercise and the next three refer to the (7,4) Hamming 
code. Decode the received strings: 

(a) r = 1101011 

(b) r = 0110110 

(c) r = 0100111 

(d) r = 1111111. 

Exercise 1.6 J 2 ' p ' 17 ^ (a) Calculate the probability of block error pe of the 
(7, 4) Hamming code as a function of the noise level / and show 
that to leading order it goes as 21 f 2 . 

(b) ^ Show that to leading order the probability of bit error pb g° es 
as 9/ 2 . 

Exercise 1.1 p l9 l Find some noise vectors that give the all-zero syndrome 
(that is, noise vectors that leave all the parity checks unviolated). How 
many such noise vectors are there? 

> Exercise 1.8. ^ I asserted above that a block decoding error will result when- 
ever two or more bits are flipped in a single block. Show that this is 
indeed so. [In principle, there might be error patterns that, after de- 
coding, led only to the corruption of the parity bits, with no source bits 
incorrectly decoded.] 

Summary of codes ' performances 

Figure 1.18 shows the performance of repetition codes and the Hamming code. 
It also shows the performance of a family of linear block codes that are gen- 
eralizations of Hamming codes, called BCH codes. 

This figure shows that we can, using linear block codes, achieve better 
performance than repetition codes; but the asymptotic situation still looks 
grim. 
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Figure 1.18. Error probability pb 
versus rate R for repetition codes, 
the (7, 4) Hamming code and 
BCH codes with blocklcngths up 
to 1023 over a binary symmetric 
channel with / = 0.1. The 
righthand figure shows Pb on a 
logarithmic scale. 



Exercise 1.9.^' p l9 -l Design an error-correcting code and a decoding algorithm 
for it, estimate its probability of error, and add it to figure 1.18. [Don't 
worry if you find it difficult to make a code better than the Hamming 
code, or if you find it difficult to find a good decoder for your code; that's 
the point of this exercise.] 

• Exercise 1.10. [3 ' p ' 20] A (7,4) Hamming code can correct any one error; might 

there be a (14,8) code that can correct any two errors? 

Optional extra: Does the answer to this question depend on whether the 
code is linear or nonlinear? 

• ' Exercise 1.11."' p ' 21 ] Design an error-correcting code, other than a repetition 
«J5C code, that can correct any two errors in a block of size N. 



► 1.3 What performance can the best codes achieve? 

There seems to be a trade-off between the decoded bit-error probability pb 
(which we would like to reduce) and the rate R (which we would like to keep 
large). How can this trade-off be characterized? What points in the {R,Pb) 
plane are achievable? This question was addressed by Claude Shannon in his 
pioneering paper of 1948, in which he both created the field of information 
theory and solved most of its fundamental problems. 

At that time there was a widespread belief that the boundary between 
achievable and nonachievable points in the (R,p-^) plane was a curve passing 
through the origin (R,pb) = (0,0); if this were so, then, in order to achieve 
a vanishingly small error probability pb, one would have to reduce the rate 
correspondingly close to zero. 'No pain, no gain.' 

However, Shannon proved the remarkable result that the boundary be- 
tween achievable and nonachievable points meets the R axis at a non-zero 
value R = C, as shown in figure 1.19. For any channel, there exist codes that 
make it possible to communicate with arbitrarily small probability of error pb 
at non-zero rates. The first half of this book (Parts I— III) will be devoted to 
understanding this remarkable result, which is called the noisy-channel coding 
theorem. 



Example: f = 0.1 

The maximum rate at which communication is possible with arbitrarily small 
Pb is called the capacity of the channel. The formula for the capacity of a 
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Figure 1.19. Shannon's 
noisy-channel coding theorem. 
The solid curve shows the 
Shannon limit on achievable 
values of (R,Ph) for the binary 
symmetric channel with / = 0.1. 
Rates up to R = C are achievable 
with arbitrarily small pb • The 
points show the performance of 
some textbook codes, as in 
figure 1.18. 

The equation defining the 
Shannon limit (the solid curve) is 
R = C/(l - H 2 {pb)), where C and 
H 2 are defined in equation (1.35). 



Rate Rate 



binary symmetric channel with noise level / is 



C(f) = 1 - H 2 (f) = 1 



/log 2 ^ + (l-/)log 2 



/ v " ~ 1 - / 



(1.35) 



the channel we were discussing earlier with noise level / = 0.1 has capacity 
C ~ 0.53. Let us consider what this means in terms of noisy disk drives. The 
repetition code R3 could communicate over this channel with pb = 0.03 at a 
rate R = 1/3. Thus we know how to build a single gigabyte disk drive with 
Pb = 0.03 from three noisy gigabyte disk drives. We also know how to make a 
single gigabyte disk drive with pb — 10~ 15 from sixty noisy one-gigabyte drives 
(exercise 1.3, p. 8). And now Shannon passes by, notices us juggling with disk 
drives and codes and says: 

'What performance are you trying to achieve? 10~ 15 ? You don't 
need sixty disk drives - you can get that performance with just 
two disk drives (since 1/2 is less than 0.53). And if you want 
Ph = 10~ 18 or 10~ 24 or anything, you can get there with two disk 
drives too!' 



[Strictly, the above statements might not be quite right, since, as we shall see, 
Shannon proved his noisy-channel coding theorem by studying sequences of 
block codes with ever-increasing blocklengths, and the required blocklength 
might be bigger than a gigabyte (the size of our disk drive), in which case, 
Shannon might say 'well, you can't do it with those tiny disk drives, but if you 
had two noisy terabyte drives, you could make a single high-quality terabyte 
drive from them'.] 



► 1.4 Summary 

The (7, 4) Hamming Code 

By including three parity-check bits in a block of 7 bits it is possible to detect 
and correct any single bit error in each block. 



Shannon's noisy-channel coding theorem 

Information can be communicated over a noisy channel at a non-zero rate with 
arbitrarily small error probability. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 

16 1 — Introduction to Information Theory 



Information theory addresses both the limitations and the possibilities of 
communication. The noisy-channel coding theorem, which we will prove in 
Chapter 10, asserts both that reliable communication at any rate beyond the 
capacity is impossible, and that reliable communication at all rates up to 
capacity is possible. 

The next few chapters lay the foundations for this result by discussing 
how to measure information content and the intimately related topic of data 
compression. 



► 1.5 Further exercises 

> Exercise 1.12.^' p ' 21 ^ Consider the repetition code Rg. One way of viewing 
this code is as a concatenation of R 3 with R 3 . We first encode the 
source stream with R3, then encode the resulting output with R3. We 
could call this code 'R3'. This idea motivates an alternative decoding 
algorithm, in which we decode the bits three at a time using the decoder 
for R 3 ; then decode the decoded bits from that first decoder using the 
decoder for R3. 

Evaluate the probability of error for this decoder and compare it with 
the probability of error for the optimal decoder for Rg. 

Do the concatenated encoder and decoder for R§ have advantages over 
those for Rg? 



► 1.6 Solutions 

Solution to exercise 1.2 (p. 7). An error is made by R3 if two or more bits are 
flipped in a block of three. So the error probability of R3 is a sum of two 
terms: the probability that all three bits are flipped, / 3 ; and the probability 
that exactly two bits are flipped, 3/ 2 (l — /). [If these expressions are not 
obvious, see example 1.1 (p.l): the expressions are P(r = 3\f,N = S) and 
P(r = 2\f,N = 3).} 

Pb = PB = 3/ 2 (l - /) + f = 3/ 2 - 2/ 3 . (1.36) 

This probability is dominated for small / by the term 3/ 2 . 

See exercise 2.38 (p. 39) for further discussion of this problem. 

Solution to exercise 1.3 (p.8). The probability of error for the repetition code 
Rjv is dominated by the probability that \N/2~\ bits are flipped, which goes 

(for odd TV") as Notation: \N/2] denotes the 

N \f(N+l)/2,-, t \(N-l)/2 ,,,m smallest integer greater than or 

\N/2]J* V--J) ■ V- 37 ) equal to N/2. 

The term (^) can be approximated using the binary entropy function: 

1 2 NH 2 (K/N) < ( N \ < 2 NH 2 (K/N) ^ ( N \ _ 2 NH 2 (K/N) q 3g) 

N+l ~ \KJ ~ \KJ y ' ' 

where this approximation introduces an error of order \/N - as shown in 
equation (1.17). So 

Ph = PB ~ 2 N (f(l - f)) N l 2 = (4/(1 - f)) N '\ (1.39) 

Setting this equal to the required value of 10~ 15 we find TV ~ 2 k j°^° 1 _^ = 68. 
This answer is a little out because the approximation we used overestimated 
(^) and we did not distinguish between \N/2\ and N/2. 
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A slightly more careful answer (short of explicit computation) goes as follows. 
Taking the approximation for ( K ) to the next order, we find: 

{N/h 2N 7mi- (L40) 

This approximation can be proved from an accurate version of Stirling's ap- 
proximation (1.12), or by considering the binomial distribution with p = 1/2 
and noting 



1 



N/2 



where a = \J N/4, from which equation (1.40) follows. The distinction between 
fiV/2] and N/2 is not important in this term since (^) has a maximum at 
K = N/2. 

Then the probability of error (for odd N) is to leading order 



Pb 



((iv+i)/ 2 ) /<Ar+1)/2(1 - /)(JV " )/2 (L42) 

* 2^^2= / [/(l - f)]^ 1 )/ 2 * _^/[4/(l-/)]^-^(l.43) 
y/nN/2 ^/tyN/8 

The equation pb = 10~ 15 can be written 

logl0- 15 +log^2 

(N-l)/2~ r-£ (1.44) 

V " log4/(l-/) V ' 

which may be solved for N iteratively, the first iteration starting from N\ =68: 
{N 2 - l)/2 ~ "^q^' 7 = 29.9 iV 2 ~ 60.9 (1.45) 
This answer is found to be stable, so iV ~ 61 is the blocklcngth at which 

p b ~ io- 15 . 

Solution to exercise 1.6 (p. 13). 

(a) The probability of block error of the Hamming code is a sum of six terms 
- the probabilities that 2, 3, 4, 5, 6, or 7 errors occur in one block. 



E CV r(1 - /)7 ~ r - (1 - 46) 



7 

Pb = ' 

r=2 

To leading order, this goes as 

PB ^ Q/ 2 = 21/ 2 . (1.47) 

(b) The probability of bit error of the Hamming code is smaller than the 
probability of block error because a block error rarely corrupts all bits in 
the decoded block. The leading-order behaviour is found by considering 
the outcome in the most probable case where the noise vector has weight 
two. The decoder will erroneously flip a third bit, so that the modified 
received vector (of length 7) differs in three bits from the transmitted 
vector. That means, if we average over all seven bits, the probability that 
a randomly chosen bit is flipped is 3/7 times the block error probability, 
to leading order. Now, what we really care about is the probability that 
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a source bit is nipped. Are parity bits or source bits more likely to be 
among these three flipped bits, or are all seven bits equally likely to be 
corrupted when the noise vector has weight two? The Hamming code 
is in fact completely symmetric in the protection it affords to the seven 
bits (assuming a binary symmetric channel). [This symmetry can be 
proved by showing that the role of a parity bit can be exchanged with 
a source bit and the resulting code is still a (7, 4) Hamming code; see 
below.] The probability that any one bit ends up corrupted is the same 
for all seven bits. So the probability of bit error (for the source bits) is 
simply three sevenths of the probability of block error. 



Pb ^ TjPB ^ 9/ • 



(1.48) 



Symmetry of the Hamming (7, 4) code 



To prove that the (7, 4) code protects all bits equally, we start from the parity- 
check matrix 

"1110100" 
H = 1110 10 . (1.49) 
10 110 1 

The symmetry among the seven transmitted bits will be easiest to see if we 
reorder the seven bits using the permutation (t 1*2*3*4*5*6*7) (*5*2*3*4*1*6*7)- 
Then we can rewrite H thus: 



H 



1110 10 
1110 10 
1110 1 



(1.50) 



Now, if we take any two parity constraints that t satisfies and add them 
together, we get another parity constraint. For example, row 1 asserts *5 + 
*2 + *3 + *1 = even, and row 2 asserts t 2 + *3 + *4 + *6 = even, and the sum of 
these two constraints is 



t 5 + 2t 2 + 2t 3 + *i + * 4 + * 6 = even; 



(1.51) 



we can drop the terms 2t 2 and 2*3, since they are even whatever t 2 and *3 are; 
thus we have derived the parity constraint *5 + *i + *4 + *6 = even, which we 
can if we wish add into the parity-check matrix as a fourth row. [The set of 
vectors satisfying Ht = will not be changed.] We thus define 



H' 



1110 10 
1110 10 
1110 1 
10 1110 



(1.52) 



The fourth row is the sum (modulo two) of the top two rows. Notice that the 
second, third, and fourth rows are all cyclic shifts of the top row. If, having 
added the fourth redundant constraint, we drop the first constraint, we obtain 
a new parity-check matrix H", 



H" 



1110 10 
1110 1 
10 1110 



(1.53) 



which still satisfies H"t = for all codewords, and which looks just like 
the starting H in (1.50), except that all the columns have shifted along one 
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to the right, and the rightmost column has reappeared at the left (a cyclic 
permutation of the columns). 

This establishes the symmetry among the seven bits. Iterating the above 
procedure five more times, we can make a total of seven different H matrices 
for the same original code, each of which assigns each bit to a different role. 

We may also construct the super-redundant seven-row parity-check matrix 
for the code, 

"1110 10 
1110 10 
1110 1 
H" = 1 1 1 1 . (1.54) 
10 111 
10 10 11 
110 10 1 

This matrix is 'redundant' in the sense that the space spanned by its rows is 
only three-dimensional, not seven. 

This matrix is also a cyclic matrix. Every row is a cyclic permutation of 
the top row. 

Cyclic codes: if there is an ordering of the bits t\ . . An such that a linear 
code has a cyclic parity-check matrix, then the code is called a cyclic 
code. 

The codewords of such a code also have cyclic properties: any cyclic 
permutation of a codeword is a codeword. 

For example, the Hamming (7, 4) code, with its bits ordered as above, 
consists of all seven cyclic shifts of the codewords 1110100 and 1011000, 
and the codewords 0000000 and 1111111. 

Cyclic codes are a cornerstone of the algebraic approach to error-correcting 
codes. We won't use them again in this book, however, as they have been 
superceded by sparse-graph codes (Part VI). 

Solution to exercise 1.7 (p. 13). There are fifteen non-zero noise vectors which 
give the all-zero syndrome; these are precisely the fifteen non-zero codewords 
of the Hamming code. Notice that because the Hamming code is linear, the 
sum of any two codewords is a codeword. 



Graphs corresponding to codes 

Solution to exercise 1.9 (p. 14). When answering this question, you will prob- 
ably find that it is easier to invent new codes than to find optimal decoders 
for them. There are many ways to design codes, and what follows is just one 
possible train of thought. We make a linear block code that is similar to the 
(7, 4) Hamming code, but bigger. 

Many codes can be conveniently expressed in terms of graphs. In fig- 
ure 1.13, we introduced a pictorial representation of the (7,4) Hamming code. 
If we replace that figure's big circles, each of which shows that the parity of 
four particular bits is even, by a 'parity-check node' that is connected to the 
four bits, then we obtain the representation of the (7, 4) Hamming code by a 
bipartite graph as shown in figure 1.20. The 7 circles are the 7 transmitted 
bits. The 3 squares are the parity-check nodes (not to be confused with the 
3 parity-check bits, which are the three most peripheral circles). The graph 
is a 'bipartite' graph because its nodes fall into two classes - bits and checks 




Figure 1.20. The graph of the 
(7, 4) Hamming code. The 7 
circles are the bit nodes and the 3 
squares are the parity-check 
nodes. 
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- and there are edges only between nodes in different classes. The graph and 
the code's parity-check matrix (1.30) are simply related to each other: each 
parity-check node corresponds to a row of H and each bit node corresponds to 
a column of H; for every 1 in H, there is an edge between the corresponding 
pair of nodes. 

Having noticed this connection between linear codes and graphs, one way 
to invent linear codes is simply to think of a bipartite graph. For example, 
a pretty bipartite graph can be obtained from a dodecahedron by calling the 
vertices of the dodecahedron the parity-check nodes, and putting a transmitted 
bit on each edge in the dodecahedron. This construction defines a parity- 
check matrix in which every column has weight 2 and every row has weight 3. 
[The weight of a binary vector is the number of Is it contains.] 

This code has N = 30 bits, and it appears to have M apparen t = 20 parity- 
check constraints. Actually, there are only M = 19 independent constraints; 
the 20th constraint is redundant (that is, if 19 constraints are satisfied, then 
the 20th is automatically satisfied); so the number of source bits is K = 
N - M =11. The code is a (30, 11) code. 

It is hard to find a decoding algorithm for this code, but we can estimate 
its probability of error by finding its lowest weight codewords. If we flip all 
the bits surrounding one face of the original dodecahedron, then all the parity 
checks will be satisfied; so the code has 12 codewords of weight 5, one for each 
face. Since the lowest-weight codewords have weight 5, we say that the code 
has distance d = 5; the (7, 4) Hamming code had distance 3 and could correct 
all single bit-flip errors. A code with distance 5 can correct all double bit-flip 
errors, but there are some triple bit-flip errors that it cannot correct. So the 
error probability of this code, assuming a binary symmetric channel, will be 
dominated, at least for low noise levels /, by a term of order f 3 , perhaps 
something like 



Of course, there is no obligation to make codes whose graphs can be rep- 
resented on a plane, as this one can; the best linear codes, which have simple 
graphical descriptions, have graphs that are more tangled, as illustrated by 
the tiny (16,4) code of figure 1.22. 

Furthermore, there is no reason for sticking to linear codes; indeed some 
nonlinear codes - codes whose codewords cannot be defined by a linear equa- 
tion like Ht = have very good properties. But the encoding and decoding 
of a nonlinear code are even trickier tasks. 

Solution to exercise 1.10 (p. 14). First let's assume we are making a linear 
code and decoding it with syndrome decoding. If there are N transmitted 
bits, then the number of possible error patterns of weight up to two is 



For N = 14, that's 91 + 14+1 = 106 patterns. Now, every distinguishable 
error pattern must give rise to a distinct syndrome; and the syndrome is a 
list of M bits, so the maximum possible number of syndromes is 2 M . For a 
(14,8) code, M = 6, so there are at most 2 6 = 64 syndromes. The number of 
possible error patterns of weight up to two, 106, is bigger than the number of 
syndromes, 64, so we can immediately rule out the possibility that there is a 
(14,8) code that is 2-error-correcting. 




Figure 1.21. The graph defining 
the (30, 11) dodecahedron code. 
The circles are the 30 transmitted 
bits and the triangles are the 20 
parity checks. One parity check is 
redundant. 




Figure 1.22. Graph of a rate-V4 
low-density parity-check code 
(Gallager code) with blocklength 
N = 16, and M = 12 parity-check 
constraints. Each white circle 
represents a transmitted bit. Each 
bit participates in j = 3 
constraints, represented by 
squares. The edges between nodes 
were placed at random. (See 
Chapter 47 for more.) 





(1.56) 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



1.6: Solutions 



The same counting argument works fine for nonlinear codes too. When 
the decoder receives r = t + n, his aim is to deduce both t and n from r. If 
it is the case that the sender can select any transmission t from a code of size 
St, and the channel can select any noise vector from a set of size S n , and those 
two selections can be recovered from the received bit string r, which is one of 
at most 2 N possible strings, then it must be the case that 

S t S n < 2 N . (1.57) 

So, for a (N,K) two-error-correcting code, whether linear or nonlinear, 



2 K 



N\ (N\ (N 
2 + 1 + 



< 2 N . (1.58) 



Solution to exercise 1.11 (p. 14). There are various strategies for making codes 
that can correct multiple errors, and I strongly recommend you think out one 
or two of them for yourself. 

If your approach uses a linear code, e.g., one with a collection of M parity 
checks, it is helpful to bear in mind the counting argument given in the previous 
exercise, in order to anticipate how many parity checks, M, you might need. 

Examples of codes that can correct any two errors are the (30, 11) dodec- 
ahedron code in the previous solution, and the (15,6) pentagonful code to be 
introduced on p. 221. Further simple ideas for making codes that can correct 
multiple errors from codes that can correct only one error are discussed in 
section 13.7. 



Solution to exercise 1.12 (p. 16). The probability of error of R| is, to leading 
order, 

Pb(R§) ~ 3 [ Pb (R 3 )] 2 = 3(3/ 2 ) 2 + • • • = 27/ 4 + • • • , (1.59) 

whereas the probability of error of R9 is dominated by the probability of five 
flips, 

Pb (R 9 ) ~ Q f(l - ft * 126/ 5 + • • • . (1.60) 

The R| decoding procedure is therefore suboptimal, since there are noise vec- 
tors of weight four that cause it to make a decoding error. 

It has the advantage, however, of requiring smaller computational re- 
sources: only memorization of three bits, and counting up to three, rather 
than counting up to nine. 

This simple code illustrates an important concept. Concatenated codes 
are widely used in practice because concatenation allows large codes to be 
implemented using simple encoding and decoding hardware. Some of the best 
known practical codes are concatenated codes. 
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Probability, Entropy, and Inference 



This chapter, and its sibling, Chapter 8, devote some time to notation. Just 
as the White Knight distinguished between the song, the name of the song, 
and what the name of the song was called (Carroll, 1998), we will sometimes 
need to be careful to distinguish between a random variable, the value of the 
random variable, and the proposition that asserts that the random variable 
has a particular value. In any particular chapter, however, I will use the most 
simple and friendly notation possible, at the risk of upsetting pure-minded 
readers. For example, if something is 'true with probability 1', I will usually 
simply say that it is 'true'. 

► 2.1 Probabilities and ensembles 

An ensemble X is a triple (x, Ax > Px), where the outcome x is the value 
of a random variable, which takes on one of a set of possible values, 
Ax = {ai,a 2 , . . . ,o», . . . ,o/}, having probabilities V x = {pi,P2, ■ ■ ■ ,Pi}, 
with P(x = a,i) = Pi, Pi>0 and J2 ai eA x p ( x = a i) = 1- 

The name A is mnemonic for 'alphabet'. One example of an ensemble is a 
letter that is randomly selected from an English document. This ensemble is 
shown in figure 2.1. There are twenty-seven possible letters: a-z, and a space 
character '-'. 

Abbreviations. Briefer notation will sometimes be used. For example, 
P(x = cij) may be written as P(aj) or P(x). 



Probability of a subset. If T is a subset of Ax then: 

P{T) = P{xeT)= Y j P{x = a i ). 



(2.1) 



For example, if we define V to be vowels from figure 2.1, V 
{a, e, i, o, u}, then 



P(V) = 0.06 + 0.09 + 0.06 + 0.07 + 0.03 = 0.31. 



(2.2) 



A joint ensemble XY is an ensemble in which each outcome is an ordered 
pair x, y with x € Ax = Wi, ■ ■ ■ , a-i} and y e Ay = {&i, • • • , bj}. 

We call P(x, y) the joint probability of x and y. 

Commas are optional when writing ordered pairs, so xy x, y. 

N.B. In a joint ensemble XY the two variables are not necessarily inde- 
pendent. 
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Figure 2.1. Probability 
distribution over the 27 outcomes 
for a randomly selected letter in 
an English language document 
(estimated from The Frequently 
Asked Questions Manual for 
Linux). The picture shows the 
probabilities by the areas of white 
squares. 
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2.1: Probabilities and ensembles 




a b c d e f g h i j k 1 m n o p q r s t u v w x y z - y 



Marginal probability. We can obtain the marginal probability P(x) from 
the joint probability P(x, y) by summation: 

P{x = ai ) = P(x = Oi,y). (2.3) 

yEAy 

Similarly, using briefer notation, the marginal probability of y is: 

P(y) = P(?,V)- (2-4) 

x£Ax 

Conditional probability 

P(x = a i \y = b j )^ P{x =^ y = ) bj) if P(y = bj)^0. (2.5) 

[If P(y = bj) = then P(x = a, | y = bj) is undefined.] 

We pronounce P(x = cii \ y = bj) 'the probability that x equals aj, given 
y equals bj'. 

Example 2.1. An example of a joint ensemble is the ordered pair XY consisting 
of two successive letters in an English document. The possible outcomes 
are ordered pairs such as aa, ab, ac, and zz; of these, we might expect 
ab and ac to be more probable than aa and zz. An estimate of the 
joint probability distribution for two neighbouring characters is shown 
graphically in figure 2.2. 

This joint ensemble has the special property that its two marginal dis- 
tributions, P(x) and P(y), are identical. They are both equal to the 
monogram distribution shown in figure 2.1. 

From this joint ensemble P(x, y) we can obtain conditional distributions, 
P(y | x) and P(x \y), by normalizing the rows and columns, respectively 
(figure 2.3). The probability P(y\x = q) is the probability distribution 
of the second letter given that the first letter is a q. As you can see in 
figure 2.3a, the two most probable values for the second letter y given 
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Figure 2.2. The probability 
distribution over the 27x27 
possible bigrams xy in an English 
language document, The 
Frequently Asked Questions 
Manual for Linux. 
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abcdefghijklmnopqrstuvwxyz 

(a) P(y\x) 




Figure 2.3. Conditional 
probability distributions, (a) 
P(y | x): Each row shows the 
conditional distribution of the 
second letter, y, given the first 
letter, x, in a bigram xy. (b) 
P(x | y): Each column shows the 
conditional distribution of the 
first letter, x, given the second 
letter, y. 



abcdefghijklmnopqrstuvwxyz- y 

(b) P(x | y) 



that the first letter x is q are u and -. (The space is common after q 
because the source document makes heavy use of the word FAQ.) 

The probability P{x\y = M) is the probability distribution of the first 
letter x given that the second letter y is a u. As you can see in figure 2.3b 
the two most probable values for x given y = u are n and o. 

Rather than writing down the joint probability directly, we often define an 
ensemble in terms of a collection of conditional probabilities. The following 
rules of probability theory will be useful. (H denotes assumptions on which 
the probabilities are based.) 

Product rule - obtained from the definition of conditional probability: 

P(x, y\H) = P(x\ y, H)P(y | H) = P(y | x, H)P(x | H). (2.6) 
This rule is also known as the chain rule. 
Sum rule - a rewriting of the marginal probability definition: 

P(x\H) = £)P(a:,y|W) (2.7) 
y 

= ^P(x|y,W)PQ/|W). (2.8) 

y 



Bayes' theorem - obtained from the product rule: 



P(x\y,H)P(y\H) 
Ej,' P{x\y',H)P{y'\ny 



(2.10) 



Independence. Two random variables X and Y are independent (sometimes 
written X1Y) if and only if 

P(x,y) = P(x)P(y). (2.11) 

Exercise 2.2.^' P' 4 ^ Are the random variables X and Y in the joint ensemble 
of figure 2.2 independent? 
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2.2: The meaning of probability 



I said that we often define an ensemble in terms of a collection of condi- 
tional probabilities. The following example illustrates this idea. 

Example 2.3. Jo has a test for a nasty disease. We denote Jo's state of health 
by the variable o and the test result by b. 

a = 1 Jo has the disease , > 

a = Jo does not have the disease. 

The result of the test is either 'positive' (b = 1) or 'negative' (b = 0); 
the test is 95% reliable: in 95% of cases of people who really have the 
disease, a positive result is returned, and in 95% of cases of people who 
do not have the disease, a negative result is obtained. The final piece of 
background information is that 1% of people of Jo's age and background 
have the disease. 

OK - Jo has the test, and the result is positive. What is the probability 
that Jo has the disease? 

Solution. We write down all the provided probabilities. The test reliability 
specifies the conditional probability of b given a: 

P(b=l\a=l) = 0.95 P(b = l \a = 0) = 0.05 . . 

P(b = 0\a=l) = 0.05 P(b = 0\a = 0) = 0.95; ( ' 

and the disease prevalence tells us about the marginal probability of a: 

P(o = l)=0.01 P( a = 0) = 0.99. (2.14) 

From the marginal P(a) and the conditional probability P(b \ a) we can deduce 
the joint probability P(a, b) = P(a)P(b | a) and any other probabilities we are 
interested in. For example, by the sum rule, the marginal probability of 6 = 1 
- the probability of getting a positive result - is 

P(b=l) = P(b = l\a = l)P(a = l) +P{b=l\a = 0)P{a = 0). (2.15) 

Jo has received a positive result 6=1 and is interested in how plausible it is 
that she has the disease (i.e., that a=l). The man in the street might be 
duped by the statement 'the test is 95% reliable, so Jo's positive result implies 
that there is a 95% chance that Jo has the disease', but this is incorrect. The 
correct solution to an inference problem is found using Bayes' theorem. 

P{b=l\a=l)P{a=l) 

P(a-l\b-l) - p(6= I | a=1 )p( a=1 ) + P (b=l \a=0)P{a=0) (2 ' 16) 

0-95 x 0.01 

0.95 x 0.01 + 0.05 x 0.99 1 ' ' 

= 0.16. (2.18) 

So in spite of the positive result, the probability that Jo has the disease is only 
16%. □ 



► 2.2 The meaning of probability 

Probabilities can be used in two ways. 

Probabilities can describe frequencies of outcomes in random experiments, 
but giving noncircular definitions of the terms 'frequency' and 'random' is a 
challenge - what does it mean to say that the frequency of a tossed coin's 
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Notation. Let 'the degree of belief in proposition x' be denoted by B(x). The 
negation of x (not-x) is written x. The degree of belief in a condi- 
tional proposition, 'x, assuming proposition y to be true', is represented 
by B(x | y). 

Axiom 1. Degrees of belief can be ordered; if B(x) is 'greater' than B(y), and 
B(y) is 'greater' than B(z), then B(x) is 'greater' than B(z). 

[Consequence: beliefs can be mapped onto real numbers.] 

Axiom 2. The degree of belief in a proposition x and its negation x are related. 
There is a function / such that 

B(x) = f[B(x)\. 

Axiom 3. The degree of belief in a conjunction of propositions x, y (x AND y) is 
related to the degree of belief in the conditional proposition x \ y and the 
degree of belief in the proposition y. There is a function g such that 

B(x,y)=g[B(x\y),B(y)}. 



Box 2.4. The Cox axioms. 
If a set of beliefs satisfy these 
axioms then they can be mapped 
onto probabilities satisfying 
P(false) = 0, P(true) = 1, 
< P(x) < 1, and the rules of 
probability: 

P(x) = 1 - P(x), 

and 

P(x,y) = P(x\y)P(y). 



coming up heads is V2? If we say that this frequency is the average fraction of 
heads in long sequences, we have to define 'average'; and it is hard to define 
'average' without using a word synonymous to probability! I will not attempt 
to cut this philosophical knot. 

Probabilities can also be used, more generally, to describe degrees of be- 
lief in propositions that do not involve random variables - for example 'the 
probability that Mr. S. was the murderer of Mrs. S., given the evidence' (he 
either was or wasn't, and it's the jury's job to assess how probable it is that he 
was); 'the probability that Thomas Jefferson had a child by one of his slaves'; 
'the probability that Shakespeare's plays were written by Francis Bacon'; or, 
to pick a modern-day example, 'the probability that a particular signature on 
a particular cheque is genuine'. 

The man in the street is happy to use probabilities in both these ways, but 
some books on probability restrict probabilities to refer only to frequencies of 
outcomes in repeatable random experiments. 

Nevertheless, degrees of belief can be mapped onto probabilities if they sat- 
isfy simple consistency rules known as the Cox axioms (Cox, 1946) (figure 2.4). 
Thus probabilities can be used to describe assumptions, and to describe in- 
ferences given those assumptions. The rules of probability ensure that if two 
people make the same assumptions and receive the same data then they will 
draw identical conclusions. This more general use of probability to quantify 
beliefs is known as the Bayesian viewpoint. It is also known as the subjective 
interpretation of probability, since the probabilities depend on assumptions. 
Advocates of a Bayesian approach to data modelling and pattern recognition 
do not view this subjectivity as a defect, since in their view, 



you cannot do inference without making assumptions. 



In this book it will from time to time be taken for granted that a Bayesian 
approach makes sense, but the reader is warned that this is not yet a globally 
held view - the field of statistics was dominated for most of the 20th century 
by non-Bayesian methods in which probabilities are allowed to describe only 
random variables. The big difference between the two approaches is that 
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2.3: Forward probabilities and inverse probabilities 
Bayesians also use probabilities to describe inferences. 

► 2.3 Forward probabilities and inverse probabilities 

Probability calculations often fall into one of two categories: forward prob- 
ability and inverse probability. Here is an example of a forward probability 
problem: 

Exercise 2.4.^' p ' 40 l An urn contains K balls, of which B are black and W = 
K — B are white. Fred draws a ball at random from the urn and replaces 
it, N times. 

(a) What is the probability distribution of the number of times a black 
ball is drawn, n^? 

(b) What is the expectation of ngl What is the variance of n^? What 
is the standard deviation of n^? Give numerical answers for the 
cases N = 5 and TV = 400, when B = 2 and K = 10. 

Forward probability problems involve a generative model that describes a pro- 
cess that is assumed to give rise to some data; the task is to compute the 
probability distribution or expectation of some quantity that depends on the 
data. Here is another example of a forward probability problem: 

Exercise 2.5 \ Z ' p ' 40 ^ An urn contains K balls, of which B are black and W = 
K — B are white. We define the fraction fg = B/K. Fred draws N 
times from the urn, exactly as in exercise 2.4, obtaining ng blacks, and 
computes the quantity 

z {ng-fgNf 

Z ~ Nfg(l-fg)- ^ 19) 

What is the expectation of zl In the case N = 5 and fg = 1/5, what 
is the probability distribution of zl What is the probability that z < 11 
[Hint: compare z with the quantities computed in the previous exercise.] 

Like forward probability problems, inverse probability problems involve a 
generative model of a process, but instead of computing the probability distri- 
bution of some quantity produced by the process, we compute the conditional 
probability of one or more of the unobserved variables in the process, given 
the observed variables. This invariably requires the use of Bayes' theorem. 

Example 2.6. There are eleven urns labelled by u € {0, 1,2,..., 10}, each con- 
taining ten balls. Urn u contains u black balls and 10 — u white balls. 
Fred selects an urn u at random and draws N times with replacement 
from that urn, obtaining ng blacks and N — ng whites. Fred's friend, 
Bill, looks on. If after N = 10 draws ng = 3 blacks have been drawn, 
what is the probability that the urn Fred is using is urn u, from Bill's 
point of view? (Bill doesn't know the value of u.) 

Solution. The joint probability distribution of the random variables u and ng 
can be written 

P{u,n B | AT) = P{n B \u,N)P(u). (2.20) 

From the joint probability of u and ng, we can obtain the conditional 
distribution of u given ng: 

P(n B \u,N)P(u) 

P(ng\N) ■ ^ 
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Figure 2.5. Joint probability of u 
and lis for Bill and Fred's urn 
problem, after N = f0 draws. 



0123456789 10 n B 



The marginal probability of u is P(u) = jj for all u. You wrote down the 
probability of n B given u and N, P(n B \ u, N), when you solved exercise 2.4 
(p. 27). [You are doing the highly recommended exercises, aren't you?] If we 
define f u = u/10 then 



P(n B \u,N) = 



N 
tie 



N~n B 



(2.23) 



What about the denominator, P(n B \ N)? This is the marginal probability of 
riB, which we can obtain using the sum rule: 

P(n B | N) = P(u, n B \N) = J2 P{u)P{n B \u,N). (2.24) 



So the conditional probability of u given n B is 

P{u)P{n B \u,N) 



P(u\n B ,N) 



P(n B | N) 
1 1 / N 



P(n B \N) 11 \n B 



/ra-/u) 



N-n B 



(2.25) 
(2.26) 



This conditional distribution can be found by normalizing column 3 of 
figure 2.5 and is shown in figure 2.6. The normalizing constant, the marginal 
probability of ns, is P{ns = 3 | N= 10) = 0.083. The posterior probability 
(2.26) is correct for all u, including the end-points w = and u=10, where 
f u = and f u = 1 respectively. The posterior probability that u = given 
ns = 3 is equal to zero, because if Fred were drawing from urn it would be 
impossible for any black balls to be drawn. The posterior probability that 
u=10 is also zero, because there are no white balls in that urn. The other 
hypotheses u=l, u = 2, ... u = 9 all have non-zero posterior probability. □ 
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Figure 2.6. Conditional 
probability of u given n B = 3 and 
AT = 10. 



Terminology of inverse probability 

In inverse probability problems it is convenient to give names to the proba- 
bilities appearing in Bayes' theorem. In equation (2.25), we call the marginal 
probability P(u) the prior probability of u, and P(hb \ u, N) is called the like- 
lihood of u. It is important to note that the terms likelihood and probability 
are not synonyms. The quantity P{ns \u,N) is a function of both ns and 
u. For fixed u, P(ns\u,N) defines a probability over n B - For fixed ns, 
P(ub I u, N) defines the likelihood of u. 
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2.3: Forward probabilities and inverse probabilities 



Never say 'the likelihood of the data'. Always say 'the likelihood 
of the parameters'. The likelihood function is not a probability 
distribution. 



(If you want to mention the data that a likelihood function is associated with, 
you may say 'the likelihood of the parameters given the data'.) 

The conditional probability P(u \ub, N) is called the posterior probability 
of u given ns- The normalizing constant P(iib \ N) has no w-dependence so its 
value is not important if we simply wish to evaluate the relative probabilities 
of the alternative hypotheses u. However, in most data-modelling problems 
of any complexity, this quantity becomes important, and it is given various 
names: P(ub \ N) is known as the evidence or the marginal likelihood. 

If denotes the unknown parameters, D denotes the data, and Tt denotes 
the overall hypothesis space, the general equation: 

PW n,n ) = W§^M ( , 27) 

is written: 

likelihood x prior , 

posterior = . (2.28) 

evidence 

Inverse probability and prediction 

Example 2.6 (continued). Assuming again that Bill has observed «b = 3 blacks 
in N = 10 draws, let Fred draw another ball from the same urn. What 
is the probability that the next drawn ball is a black? [You should make 
use of the posterior probabilities in figure 2.6.] 

Solution. By the sum rule, 
P(baliAr + i is black | ng, N) = ^ P(ballAr + i is black | u, ub, N)P(u | ng, N). 

u 

(2.29) 

Since the balls are drawn with replacement from the chosen urn, the proba- 
bility P(balljv + i is black | u, ub, N) is just f u = u/10, whatever ns and N are. 
So 

P(bal\ N+1 is black | n B , N) = ^ f u P(u \ n B , N). (2.30) 

u 

Using the values of P{u \ ns, N) given in figure 2.6 we obtain 

P(ballAr+i is black | n B = 3,N =10) = 0.333. □ (2.31) 



Comment. Notice the difference between this prediction obtained using prob- 
ability theory, and the widespread practice in statistics of making predictions 
by first selecting the most plausible hypothesis (which here would be that the 
urn is urn u = 3) and then making the predictions assuming that hypothesis 
to be true (which would give a probability of 0.3 that the next ball is black). 
The correct prediction is the one that takes into account the uncertainty by 
marginalizing over the possible values of the hypothesis u. Marginalization 
here leads to slightly more moderate, less extreme predictions. 
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Inference as inverse probability 

Now consider the following exercise, which has the character of a simple sci- 
entific investigation. 

Example 2.7. Bill tosses a bent coin N times, obtaining a sequence of heads 
and tails. We assume that the coin has a probability fjj of coming up 
heads; we do not know If nu heads have occurred in N tosses, what 
is the probability distribution of /#? (For example, N might be 10, and 
nu might be 3; or, after a lot more tossing, we might have N = 300 and 
nn = 29.) What is the probability that the jV+lth outcome will be a 
head, given njj heads in N tosses? 

Unlike example 2.6 (p. 27), this problem has a subjective element. Given a 
restricted definition of probability that says 'probabilities are the frequencies 
of random variables', this example is different from the eleven-urns example. 
Whereas the urn u was a random variable, the bias fu of the coin would not 
normally be called a random variable. It is just a fixed but unknown parameter 
that we are interested in. Yet don't the two examples 2.6 and 2.7 seem to have 
an essential similarity? [Especially when N = 10 and n# = 3!] 

To solve example 2.7, we have to make an assumption about what the bias 
of the coin fjj might be. This prior probability distribution over fjj, P(fu), 
corresponds to the prior over u in the eleven-urns problem. In that example, 
the helpful problem definition specified P(u). In real life, we have to make 
assumptions in order to assign priors; these assumptions will be subjective, 
and our answers will depend on them. Exactly the same can be said for the 
other probabilities in our generative model too. We are assuming, for example, 
that the balls are drawn from an urn independently; but could there not be 
correlations in the sequence because Fred's ball-drawing action is not perfectly 
random? Indeed there could be, so the likelihood function that we use depends 
on assumptions too. In real data modelling problems, priors are subjective and 
so are likelihoods. 

We are now using P() to denote probability densities over continuous vari- 
ables as well as probabilities over discrete variables and probabilities of logical 
propositions. The probability that a continuous variable v lies between values 
a and b (where b > a) is denned to be J dv P(v). P(v)dv is dimensionless. 
The density P(v) is a dimensional quantity, having dimensions inverse to the 
dimensions of v - in contrast to discrete probabilities, which are dimensionless. 
Don't be surprised to see probability densities greater than 1. This is normal, 
and nothing is wrong, as long as f dv P(v) < 1 for any interval (a,b). 

Conditional and joint probability densities are defined in just the same way as 
conditional and joint probabilities. 

> Exercise 2.8. ^ Assuming a uniform prior on fjj, P(/h) = T solve the problem 
posed in example 2.7 (p. 30). Sketch the posterior distribution of fjj and 
compute the probability that the iV+lth outcome will be a head, for 



(a) N = 3 and n# = 0; 

(b) N = 3 and n H = 2; 

(c) TV = 10 and n H = 3; 

(d) N = 300 and n H = 29. 

You will find the beta integral useful: 



Here P(f) denotes a probability 
density, rather than a probability 
distribution. 




F b _ 



T(F a + l)T{F b +l) 
T(F a + F b + 2) 



(F a + F b + iy: 



FJF b \ 



(2.32) 



Jo 
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2.3: Forward probabilities and inverse probabilities 31 

You may also find it instructive to look back at example 2.6 (p.27) and 
equation (2.31). 

People sometimes confuse assigning a prior distribution to an unknown pa- 
rameter such as fjj with making an initial guess of the value of the parameter. 
But the prior over fu, P(Jh), is not a simple statement like 'initially, I would 
guess fjj = V2'. The prior is a probability density over fjj which specifies the 
prior degree of belief that fjj lies in any interval (/, / + 5f). It may well be 
the case that our prior for fjj is symmetric about V2, so that the mean of fu 
under the prior is V2. In this case, the predictive distribution for the first toss 
x\ would indeed be 

P(.T 1= head) = Jdf H P(f H )P( Xl =head\f H ) = Jdf H P(f H )f H = V2. 

(2.33) 

But the prediction for subsequent tosses will depend on the whole prior dis- 
tribution, not just its mean. 

Data compression and inverse probability 
Consider the following task. 

Example 2.9. Write a computer program capable of compressing binary files 
like this one: 

0000000000000000000010010001000000100000010000000000000000000000000000000000001010000000000000110000 
1000000000010000100000000010000000000000000000000100000000000000000100000000011000001000000011000100 
0000000001001000000000010001000000000000000011000000000000000000000000000010000000000000000100000000 

The string shown contains n\ = 29 Is and n = 271 0s. 

Intuitively, compression works by taking advantage of the predictability of a 
file. In this case, the source of the file appears more likely to emit 0s than 
Is. A data compression program that compresses this file must, implicitly or 
explicitly, be addressing the question 'What is the probability that the next 
character in this file is a 1?' 

Do you think this problem is similar in character to example 2.7 (p. 30)? 
I do. One of the themes of this book is that data compression and data 
modelling are one and the same, and that they should both be addressed, like 
the urn of example 2.6, using inverse probability. Example 2.9 is solved in 
Chapter 6. 

The likelihood principle 
Please solve the following two exercises. 

Example 2.10. Urn A contains three balls: one black, and two white; urn B 
contains three balls: two black, and one white. One of the urns is 
selected at random and one ball is drawn. The ball is black. What is 
the probability that the selected urn is urn A? 

Example 2.11. Urn A contains five balls: one black, two white, one green and 
one pink; urn B contains five hundred balls: two hundred black, one 
hundred white, 50 yellow, 40 cyan, 30 sienna, 25 green, 25 silver, 20 
gold, and 10 purple. [One fifth of A's balls are black; two-fifths of B's 
are black.] One of the urns is selected at random and one ball is drawn. 
The ball is black. What is the probability that the urn is urn A? 



A 




B 


•OO 




• •O 



Figure 2.7. Urns for example 2.10. 
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Figure 2.8. Urns for example 2.11. 
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What do you notice about your solutions? Does each answer depend on the 
detailed contents of each urn? 

The details of the other possible outcomes and their probabilities are ir- 
relevant. All that matters is the probability of the outcome that actually 
happened (here, that the ball drawn was black) given the different hypothe- 
ses. We need only to know the likelihood, i.e., how the probability of the data 
that happened varies with the hypothesis. This simple rule about inference is 
known as the likelihood principle. 



The likelihood principle: given a generative model for data d given 
parameters 0, P(d\8), and having observed a particular outcome 
d\ , all inferences and predictions should depend only on the function 

p(di\e). 



In spite of the simplicity of this principle, many classical statistical methods 
violate it. 

► 2.4 Definition of entropy and related functions 

The Shannon information content of an outcome x is defined to be 



h(x) = log 2 



1 



P(x) 



(2.34) 



It is measured in bits. [The word 'bit' is also used to denote a variable 
whose value is or 1; I hope context will always make clear which of the 
two meanings is intended.] 

In the next few chapters, we will establish that the Shannon information 
content /i(oj) is indeed a natural measure of the information content 
of the event x = a,. At that point, we will shorten the name of this 
quantity to 'the information content'. 

The fourth column in table 2.9 shows the Shannon information content 
of the 27 possible outcomes when a random character is picked from 
an English document. The outcome x = z has a Shannon information 
content of 10.4 bits, and x = e has an information content of 3.5 bits. 

The entropy of an ensemble X is defined to be the average Shannon in- 
formation content of an outcome: 



H(X) 



P(x) log 



1 



xeAx 



(2.35) 



with the convention for P(x) = that Ox log 1/0 = 0, since 
]img_ +6 log l/6» = 0. 

Like the information content, entropy is measured in bits. 

When it is convenient, we may also write H(X) as H(p), where p is 
the vector (pi,P2, ■ ■ ■ ,Pi)- Another name for the entropy of X is the 
uncertainty of X. 



i 


a. 


Pi 


h[pi) 


1 


a 


.0575 


4.1 


2 


b 


.0128 


6.3 


3 


c 


.0263 


5.2 


4 


d 


.0285 


5.1 


5 


e 


.0913 


3.5 


6 


f 


.0173 


5.9 


7 


g 


.0133 


6.2 


8 


h 


.0313 


5.0 


9 


i 


.0599 


4.1 


10 


j 


.0006 


10.7 


11 


k 


.0084 


6.9 


12 


1 


.0335 


4.9 


13 


m 


.0235 


5.4 


14 


n 


.0596 


4.1 


15 





.0689 


3.9 


16 


P 


.0192 


5.7 


17 


q 


.0008 


10.3 


18 


r 


.0508 


4.3 


19 


s 


.0567 


4.1 


20 


t 


.0706 


3.8 


21 


u 


.0334 


4.9 


22 


V 


.0069 


7.2 


23 


w 


.0119 


6.4 


24 


X 


.0073 


7.1 


25 


y 


.0164 


5.9 


26 


z 


.0007 


10.4 


27 




.1928 


2.4 



yviog 2 — 

i Pi 



4.1 



Table 2.9. Shannon information 
contents of the outcomes a-z. 



Example 2.12. The entropy of a randomly selected letter in an English docu- 
ment is about 4.11 bits, assuming its probability is as given in table 2.9. 
We obtain this number by averaging logl/p^ (shown in the fourth col- 
umn) under the probability distribution pi (shown in the third column). 
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2.5: Decomposability of the entropy 



We now note some properties of the entropy function. 

• H(X) > with equality iff pi = 1 for one i. ['iff' means 'if and only if'.] 

• Entropy is maximized if p is uniform: 

H(X) < \og(\A x \) with equality iff pi = l/\A x \ for all i. (2.36) 

Notation: the vertical bars '| • |' have two meanings. If Ax is a set, \Ax\ 
denotes the number of elements in Ax', if x is a number, then \x\ is the 
absolute value of x. 

The redundancy measures the fractional difference between H(X) and its max- 
imum possible value, log(|.4x|)- 

The redundancy of X is: 

We won't make use of 'redundancy' in this book, so I have not assigned 
a symbol to it. 

The joint entropy of X,Y is: 

H(X,Y)= ]T P(.*,y)log-i-. (2.38) 

xyeAxAy 1 ' ' 

Entropy is additive for independent random variables: 

H(X, Y) = H(X) + H(Y) iff P(x, y) = P(x)P(y). (2.39) 

Our definitions for information content so far apply only to discrete probability 
distributions over finite sets Ax- The definitions can be extended to infinite 
sets, though the entropy may then be infinite. The case of a probability 
density over a continuous set is addressed in section 11.3. Further important 
definitions and exercises to do with entropy will come along in section 8.1. 



► 2.5 Decomposability of the entropy 

The entropy function satisfies a recursive property that can be very useful 
when computing entropies. For convenience, we'll stretch our notation so that 
we can write H(X) as H(p), where p is the probability vector associated with 
the ensemble X. 

Let's illustrate the property by an example first. Imagine that a random 
variable ai€{0,l,2}is created by first flipping a fair coin to determine whether 
x = 0; then, if x is not 0, flipping a fair coin a second time to determine whether 
a; is 1 or 2. The probability distribution of x is 

P(x = 0) = l -- P(x = l) = l -- P(x = 2) = 1 (2.40) 

What is the entropy of XI We can either compute it by brute force: 

H(X) = V2log2+ V4log4+ V4log4 = 1.5; (2.41) 

or we can use the following decomposition, in which the value of x is revealed 
gradually. Imagine first learning whether x = 0, and then, if x is not 0, learning 
which non-zero value is the case. The revelation of whether x = or not entails 
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revealing a binary variable whose probability distribution is {V2, V2}. This 
revelation has an entropy Hi}h, V2) = \ log 2 + \ log 2 = lbit. If x is not 0, 
we learn the value of the second coin flip. This too is a binary variable whose 
probability distribution is {V2, V2}, and whose entropy is lbit. We only get 
to experience the second revelation half the time, however, so the entropy can 
be written: 

H(X) = H(V2, 1/2) + V2 H(V2, 1/2). (2.42) 

Generalizing, the observation we are making about the entropy of any 
probability distribution p = {pi,P2, ■ ■ ■ ,Pi} is that 

ff(p) = H( P1 , 1- P1 ) + (l- Pl )lf (_??_, -B-, . . . , -EO . (2.43) 

When it's written as a formula, this property looks regrettably ugly; nev- 
ertheless it is a simple property and one that you should make use of. 
Generalizing further, the entropy has the property for any m that 

H(p) = H[(p 1 +p 2 -\ 1" Pm), (Pm+l + Pm+2 H h Pi)] 

Pi Pm 



+ {Pl+---+Pm)H 



-{Pm+i H Vpi)H 



(j>l + r-Pm)'""' G?l + YPm) 

Pm+l Pi 



(Pm+l H h Pi) (Pm+l H HP/), 

(2.44) 



Example 2.13. A source produces a character a; from the alphabet A = 
{0, 1, . . . , 9, a, b, . . . , z}; with probability V3, x is a numeral (0, . . . , 9); 
with probability V3, x is a vowel (a, e,i,o,u); and with probability V3 
it's one of the 21 consonants. All numerals are equiprobable, and the 
same goes for vowels and consonants. Estimate the entropy of X. 

Solution, log 3+ i(logl0 + log5 + log21) = log3+ ± log 1050 ~ log 30 bits. □ 



► 2.6 Gibbs' inequality 

The relative entropy or Kullback-Leibler divergence between two 
probability distributions P(x) and Q(x) that are defined over the same 
alphabet Ax is 

£KL(P||Q) = £n*)l°g§[4 (2-45) 

The relative entropy satisfies Gibbs' inequality 

Dkl(P\\Q)>0 (2.46) 

with equality only if P = Q. Note that in general the relative entropy 
is not symmetric under interchange of the distributions P and Q: in 
general D KL (P\\Q) 7^ Dkl{Q\\P), so -Dkl, although it is sometimes 
called the 'KL distance', is not strictly a distance. The relative entropy 
is important in pattern recognition and neural networks, as well as in 
information theory. 



The 'ei' in Leiblcr is pronounced 
the same as in heist. 



Gibbs' inequality is probably the most important inequality in this book. It, 
and many other inequalities, can be proved using the concept of convexity. 
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► 2.7 Jensen's inequality for convex functions 

The words 'convex^' and 'concave-—-' may be pronounced 'convex-smile' and 
'concave-frown'. This terminology has useful redundancy: while one may forget 
which way up 'convex' and 'concave' arc, it is harder to confuse a smile with a 
frown. 

Convex-—^ functions. A function f(x) is convex-—- over (a, b) if every chord 
of the function lies above the function, as shown in figure 2.10; that is, 
for all x\, X2 G (a, 6) and < A < 1, 



/(An + (1 - A)x 2 ) < A/(xi) + (1 - A)/(x 2 ). 



(2.47) 



A function / is strictly convex^ if, for all X\,X2 € (a, b), the equality 
holds only for A = and A = 1. 

Similar definitions apply to concave ^ and strictly concave functions. 

Some strictly convex ' functions are 

• x 2 , e x and e~ x for all x; 

• log(l/x) and xlogx for x > 0. 




X2 

Am + (1 — X)X2 



Figure 2.10. Definition of 
convexity. 




Figure 2.11. Convex-—- functions. 



Jensen's inequality. If / is a convex -~ ' function and £ is a random variable 
then: 

£ [f(x)] > f(£[x]) , (2.48) 

where £ denotes expectation. If / is strictly convex^-- and £ [f(x)\ = 
f(£[x]), then the random variable a: is a constant. 

Jensen's inequality can also be rewritten for a concave '— ■ function, with 
the direction of the inequality reversed. 

A physical version of Jensen's inequality runs as follows. 

If a collection of masses pi are placed on a convex - curve f(x) 
at locations (xi, f(xi)), then the centre of gravity of those masses, 
which is at (£[x],£ [f{x)]), lies above the curve. 

If this fails to convince you, then feel free to do the following exercise. 

Exercise 2.14.'- 2 ' p ' 41 ^ Prove Jensen's inequality. 

Example 2.15. Three squares have average area A = 100m 2 . The average of 
the lengths of their sides is I = 10 m. What can be said about the size 
of the largest of the three squares? [Use Jensen's inequality] 

Solution. Let x be the length of the side of a square, and let the probability 
of x be V3, V3, V3 over the three lengths l\,l2,h- Then the information that 
we have is that £ [x] = 10 and £ [f{x)} = 100, where f(x) = x 2 is the function 
mapping lengths to areas. This is a strictly convex--' function. We notice 
that the equality £ [f(x)\ = f(£ [x]) holds, therefore a: is a constant, and the 
three lengths must all be equal. The area of the largest square is 100 m 2 . □ 
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Convexity and concavity also relate to maximization 
If /(x) is concave-— ^ and there exists a point at which 

^- = for all k, (2.49) 

then /(x) has its maximum value at that point. 

The converse does not hold: if a concave ^ /(x) is maximized at some x 
it is not necessarily true that the gradient V/(x) is equal to zero there. For 
example, f(x) = —\x\ is maximized at x = where its derivative is undefined; 
and f{p) = log(p), for a probability p € (0, 1), is maximized on the boundary 
of the range, at p = 1, where the gradient df(p)/dp = 1. 



► 2.8 Exercises 



Sums of random variables 

Exercise 2.16.^' p ' 4 ^ (a) Two ordinary dice with faces labelled 1, ... ,6 are 
thrown. What is the probability distribution of the sum of the val- 
ues? What is the probability distribution of the absolute difference 
between the values? 

(b) One hundred ordinary dice are thrown. What, roughly, is the prob- 
ability distribution of the sum of the values? Sketch the probability 
distribution and estimate its mean and standard deviation. 

(c) How can two cubical dice be labelled using the numbers 
{0,1,2,3,4,5,6} so that when the two dice are thrown the sum 
has a uniform probability distribution over the integers 1-12? 

(d) Is there any way that one hundred dice could be labelled with inte- 
gers such that the probability distribution of the sum is uniform? 



i 



Inference problems 
Exercise 2.17. [2 ' p 41] If q = 1 



p and a = Inp/q, show that 
1 

P = 



1 + exp(— a) 



(2.50) 



Sketch this function and find its relationship to the hyperbolic tangent 
function tanh(w) = 



e"+e- 



It will be useful to be fluent in base-2 logarithms also. If b = \og 2 p/q, 
what is b as a function of pi 

t> Exercise 2.18.^' p ' 42 ^ Let x and y be dependent random variables with x a 
binary variable taking values in Ax = {0,1}- Use Bayes' theorem to 
show that the log posterior probability ratio for x given y is 



log 



P(x = l\y) 
P(x = 0\y) 



log 



P(y\x=l) 



log 



P(x = l) 



P(y\x=0) 6 P(x = 0)' 



(2.51) 



i> Exercise 2.19.^' p ' 42 ^ Let x, d\ and d 2 be random variables such that d\ and 
c?2 are conditionally independent given a binary variable x. Use Bayes' 
theorem to show that the posterior probability ratio for x given {di} is 



P(x=l\{dj}) _ P{d 1 \x=l) P{d 2 \x=l) P{x = l) 
P(x = 0\{di}) ~ Pjd[ \x = 0) P(d 2 \x = 0) P(x = 0)' 



(2.52) 



This exercise is intended to help 
you think about the central-limit 
theorem, which says that if 
independent random variables 
X\, x 2 , ■ ■ ■ , x N have means y, n and 
finite variances a\, then, in the 
limit of large N, the sum Y2 n x « 
has a distribution that tends to a 
normal (Gaussian) distribution 
with mean 2~^ n and variance 
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2.8: Exercises 

Life in high- dimensional spaces 

Probability distributions and volumes have some unexpected properties in 
high-dimensional spaces. 

Exercise 2.20.^' p ' 42 ^ Consider a sphere of radius r in an ./V-dimensional real 
space. Show that the fraction of the volume of the sphere that is in the 
surface shell lying at values of the radius between r — e and r, where 
< e < r, is: 

/=l-(l--) . (2.53) 

Evaluate / for the cases N = 2, N = 10 and N = 1000, with (a) e/r = 0.01; 
(b) e/r = 0.5. 

Implication: points that are uniformly distributed in a sphere in N di- 
mensions, where N is large, are very likely to be in a thin shell near the 
surface. 



Expectations and entropies 

You are probably familiar with the idea of computing the expectation of a 
function of x, 

£[nx)] = (f(x)) = Y,P(x)f(x). (2.54) 

X 

Maybe you are not so comfortable with computing this expectation in cases 
where the function f(x) depends on the probability P(x). The next few ex- 
amples address this concern. 

41 Exercise 2.21. [1 > p ' 43] Let p a = 0.1, p 6 = 0.2, and p c = 0.7. Let /(a) = 10, 
f(b) = 5, and /(c) = 10/7. What is £ [/(x)]? What is £ [l/P{x)]l 

Exercise 2.22. p ' 43 ] For an arbitrary ensemble, what is £ [1/P(x)]7 

> Exercise 2.23. [1 ' p ' 43] Let p a = 0.1, p b = 0.2, andp c = 0.7. Let g(a) =0, g(b) = 1, 
and g(c) = 0. What is £ [g{x)]l 

o Exercise 2.24. [1 ' p ' 43] Let p o = 0.1, p 6 = 0.2, and p c = 0.7. What is the proba- 
bility that P(x) e [0.15,0.5]? What is 



P{x) 

log 



> 0.05 ? 



0.2 

' Exercise 2.25. ' 5 ' p ' 43 ^ Prove the assertion that H(X) < log(|^x|) with equal- 
ity iff pi = 1/1-4x1 for ah *• {\Ax\ denotes the number of elements in 
the set -4x0 [Hint: use Jensen's inequality (2.48); if your first attempt 
to use Jensen does not succeed, remember that Jensen involves both a 
random variable and a function, and you have quite a lot of freedom in 
choosing these; think about whether your chosen function / should be 
convex or concave.] 

> Exercise 2.26^ 3 ' p ' 44 -' Prove that the relative entropy (equation (2.45)) satisfies 

Dkl(P\\Q) > (Gibbs' inequality) with equality only if P = Q. 

> Exercise 2.27. ^ Prove that the entropy is indeed decomposable as described 

in equations (2.43-2.44). 
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> Exercise 2.28. ^' p ' 45 -' A random variable x € {0, 1, 2, 3} is selected by nipping 

a bent coin with bias / to determine whether the outcome is in {0, 1} or 
{2, 3}; then either flipping a second bent coin with bias g or a third bent 
coin with bias h respectively. Write down the probability distribution 
of x. Use the decomposability of the entropy (2.44) to find the entropy 
of X. [Notice how compact an expression is obtained if you make use 
of the binary entropy function H2(x), compared with writing out the 
four-term entropy explicitly] Find the derivative of H{X) with respect 
to /. [Hint: dH 2 {x)/dx = log((l - x)/x)] 

> Exercise 2.29.^' p - 45 ^ An unbiased coin is flipped until one head is thrown. 

What is the entropy of the random variable x € {1,2,3,...}, the num- 
ber of flips? Repeat the calculation for the case of a biased coin with 
probability / of coming up heads. [Hint: solve the problem both directly 
and by using the decomposability of the entropy (2.43).] 



ffy 




3 



► 2.9 Further exercises 

Forward probability 

t> bxercise 2.30. [1] An urn contains w white balls and b black balls. Two balls 
are drawn, one after the other, without replacement. Prove that the 
probability that the first ball is white is equal to the probability that the 
second is white. 

> Exercise 2.31. ^ A circular coin of diameter a is thrown onto a square grid 

whose squares are 6x6. (a < 6) What is the probability that the coin 
will lie entirely within one square? [Ans: (1 — a/6) 2 ] 

> Exercise 2.32. ^ Buffon's needle. A needle of length a is thrown onto a plane 

covered with equally spaced parallel lines with separation 6. What is 
the probability that the needle will cross a line? [Ans, if a < b: ^ a liTb\ 
[Generalization - Buffon's noodle: on average, a random curve of length 
A is expected to intersect the lines 2A /nb times.] 

Exercise 2.33. ^ Two points are selected at random on a straight line segment 
of length 1 . What is the probability that a triangle can be constructed 
out of the three resulting segments? 

Exercise 2.34. p ' 45 ^ An unbiased coin is flipped until one head is thrown. 
What is the expected number of tails and the expected number of heads? 

Fred, who doesn't know that the coin is unbiased, estimates the bias 
using / = h/(h + t), where h and t are the numbers of heads and tails 
tossed. Compute and sketch the probability distribution of /. 

N.B., this is a forward probability problem, a sampling theory problem, 
not an inference problem. Don't use Bayes' theorem. 

Exercise 2.35. p ' 45 ^ Fred rolls an unbiased six-sided die once per second, not- 
ing the occasions when the outcome is a six. 

(a) What is the mean number of rolls from one six to the next six? 

(b) Between two rolls, the clock strikes one. What is the mean number 
of rolls until the next six? 
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(c) Now think back before the clock struck. What is the mean number 
of rolls, going back in time, until the most recent six? 

(d) What is the mean number of rolls from the six before the clock 
struck to the next six? 

(e) Is your answer to (d) different from your answer to (a)? Explain. 

Another version of this exercise refers to Fred waiting for a bus at a 
bus-stop in Poissonville where buses arrive independently at random (a 
Poisson process), with, on average, one bus every six minutes. What is 
the average wait for a bus, after Fred arrives at the stop? [6 minutes.] So 
what is the time between the two buses, the one that Fred just missed, 
and the one that he catches? [12 minutes.] Explain the apparent para- 
dox. Note the contrast with the situation in Clockville, where the buses 
are spaced exactly 6 minutes apart. There, as you can confirm, the mean 
wait at a bus-stop is 3 minutes, and the time between the missed bus 
and the next one is 6 minutes. 



Conditional probability 

> Exercise 2.36.^ You meet Fred. Fred tells you he has two brothers, Alf and 

Bob. 

What is the probability that Fred is older than Bob? 

Fred tells you that he is older than Alf. Now, what is the probability 
that Fred is older than Bob? (That is, what is the conditional probability 
that F > B given that F > A?) 

> Exercise 2.37. ^ The inhabitants of an island tell the truth one third of the 

time. They lie with probability 2/3. 

On an occasion, after one of them made a statement, you ask another 
'was that statement true?' and he says 'yes'. 

What is the probability that the statement was indeed true? 

> Exercise 2.38. p ' 46 ^ Compare two ways of computing the probability of error 

of the repetition code R3, assuming a binary symmetric channel (you 
did this once for exercise 1.2 (p. 7)) and confirm that they give the same 
answer. 

Binomial distribution method. Add the probability that all three 
bits are flipped to the probability that exactly two bits are flipped. 

Sum rule method. Using the sum rule, compute the marginal prob- 
ability that r takes on each of the eight possible values, P(r). 
[-P(r) = P(s)P(r I s).] Then compute the posterior probabil- 
ity of s for each of the eight values of r. [In fact, by symmetry, 
only two example cases r = (000) and r = (001) need be consid- 
ered.] Notice that some of the inferred bits are better determined Equation (1.18) gives the 
than others. From the posterior probability P(s | r) you can read posterior probability of the input 
out the case-by-case error probability, the probability that the more s ' gi vcn tne received vector r. 
probable hypothesis is not correct, P(error|r). Find the average 
error probability using the sum rule, 



P(error) = 2__, P(r).P(error | r). 

r 



(2.55) 
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> Exercise 2.39. p ' 46 -' The frequency p n of the nth most frequent word in 
English is roughly approximated by 

v „{ IT fornel,... ,12367 

Pn -\0 n> 12 367. (A5b) 

[This remarkable l/n law is known as Zipf 's law, and applies to the word 
frequencies of many languages (Zipf, 1949).] If we assume that English 
is generated by picking words at random according to this distribution, 
what is the entropy of English (per word)? [This calculation can be found 
in 'Prediction and entropy of printed English', C.E. Shannon, Bell Syst. 
Tech. J. 30, pp. 50-64 (1950), but, inexplicably, the great man made 
numerical errors in it.] 



► 2.10 Solutions 

Solution to exercise 2.2 (p. 24). No, they are not independent. If they were 
then all the conditional distributions P(y | x) would be identical functions of 
y, regardless of x (cf. figure 2.3). 

Solution to exercise 2.4 (p. 27). We define the fraction f B = B/K. 

(a) The number of black balls has a binomial distribution. 

P(n B \f B ,N)= (^^jfl B {l-f B ) N - nB . (2.57) 

(b) The mean and variance of this distribution are: 

£[n B ] = Nf B (2.58) 

var[n B ] = Nf B (l - f B ). (2.59) 

These results were derived in example 1.1 (p.l). The standard deviation 
of n B is ^var[n B ] = y/Nf B (l - 7b)- 

When B/K =1/5 and N = 5, the expectation and variance of n B are 1 
and 4/5. The standard deviation is 0.89. 

When B/K =1/5 and N = 400, the expectation and variance of n B are 
80 and 64. The standard deviation is 8. 

Solution to exercise 2.5 (p. 27). The numerator of the quantity 

= (ng - f B N f 
Z Nf B (l - f B ) 

can be recognized as (n B — £[n B }) 2 ; the denominator is equal to the variance 
of n B (2.59), which is by definition the expectation of the numerator. So the 
expectation of 2 is 1. [A random variable like z, which measures the deviation 
of data from the expected value, is sometimes called x 2 (chi-squared).] 

In the case N = 5 and f B = 1/5, Nf B is 1, and var[n#] is 4/5. The 
numerator has five possible values, only one of which is smaller than 1: (n B — 
f B N) 2 = has probability P(n B = l) = 0.4096; so the probability that z < 1 
is 0.4096. 
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Solution to exercise 2.14 (p. 35). We wish to prove, given the property 

f(Xxi + (l-X)x 2 ) < Xf(x 1 ) + (l-X)f(x 2 ) 1 (2.60) 
that, if ^2iPi = 1 and pi > 0, 



^Pif{xi) > f I ^PiX l J . 

i=l \i=l J 



(2.61) 



We proceed by recursion, working from the right-hand side. (This proof does 
not handle cases where some pi = 0; such details are left to the pedantic 
reader.) At the first line we use the definition of convexity (2.60) with A = 



--r 1 — = Pi\ at the second line, A 




< Pif(xi) + 



and so forth. 

Solution to exercise 2.16 (p. 36). 




(2.62) 



/ ^PiXi / ^Pi 



i=3 



i=3 



□ 



(a) For the outcomes {2, 3,4, 5, 6, 7, 8, 9, 10, 11, 12}, the probabilities are V = 
rj_ llililill J_i 

136' 36' 36' 36' 36' 36' 36' 36' 36' 36' 36J- 



(b) The value of one die has mean 3.5 and variance 35/12. So the sum of 
one hundred has mean 350 and variance 3500/12 ~ 292, and by the 
central-limit theorem the probability distribution is roughly Gaussian 
(but confined to the integers), with this mean and variance. 

(c) In order to obtain a sum that has a uniform distribution we have to start 
from random variables some of which have a spiky distribution with the 
probability mass concentrated at the extremes. The unique solution is 
to have one ordinary die and one with faces 6, 6, 6, 0, 0, 0. 

(d) Yes, a uniform distribution can be created in several ways, for example To think about: does this uniform 
by labelling the rth die with the numbers {0, 1, 2, 3, 4, 5} x 6 r . distribution contradict the 



central-limit theorem? 



Solution to exercise 2.17 (p. 36). 



i n ; 



p 

q 



and q - 



■ p gives 



P 
P 



e° + 1 1 + exp(-a) ' 



(2.63) 

(2.64) 
(2.65) 



The hyperbolic tangent is 



tanh(a) 



(2.66) 
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so 

1 1 /l - e- a 



/(«) 



1 + exp(-o) 2 \1 + e 
1 / e Q / 2 - e~ Q / 2 \ 1 



1 = -(tanh(o/2) + 1). (2.67) 



2 \ e a/2 + e -a/2 I 2 

In the case b = log 2 p/q, we can repeat steps (2.63-2.65), replacing e by 2, 
to obtain 

1 

P=-, 7T- ■ 2.68 

F 1 + 2-° y ' 

Solution to exercise 2.18 (p. 36). 

, x P(y\x)P(x) 
P(x\v) = y p^ y) (2-69) 



P(x=l\y) P(y\x=l) P(x=l) 

P(x = 0\y) P(y\ x = 0) P(x = 0) 



(2.70) 



P(x=l\y) , P( 2/ |.t=1) , P(x = l) , „ N 

log — ^ L^i = w — i + W — ^ i (2 71) 

^ IO8 P( a; = 0|j/) 8 P(2/|a; = 0) 8 P(x = 0) 1 ] 

Solution to exercise 2.19 (p. 36). The conditional independence of d\ and d 2 
given x means 

P(x, d 1 ,d 2 ) = P(x)P{d 1 | x)P(d 2 | x). (2.72) 

This gives a separation of the posterior probability ratio into a series of factors, 
one for each data point, times the prior probability ratio. 

P(x=l\{d l }) _ P({d i }\x=l)P(x = l) 



P(x = 0\{di}) P({di}\x=0) P{x = 0) 

P(d 1 \x = l) P{d 2 \x = l) P{x = l) 
P{di \x = Q) P(d 2 | x = 0) Pjx = 0) ' 



(2.74) 



Life in high- dimensional spaces 

Solution to exercise 2.20 (p. 37). The volume of a hypersphere of radius r in 
N dimensions is in fact 

tt/V/2 

V ^ = wm r ^ (2 - 75) 

but you don't need to know this. For this question all that we need is the 
r- dependence, V(r, N) oc r N . So the fractional volume in (r — e, r) is 

r /v_( r _ e )JV / exN 



The fractional volumes in the shells for the required cases are: 



N 


2 


10 


1000 


e/r = 0.01 


0.02 


0.096 


0.99996 


e/r = 0.5 


0.75 


0.999 


1 _ 2- 1000 



Notice that no matter how small e is, for large enough N essentially all the 
probability mass is in the surface shell of thickness e. 
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Solution to exercise 2.21 (p.37). p a = 0.1, Pb = 0.2, p c = 0.7. /(a) = 10, 
f(b) = 5, and /(c) = 10/7. 

£ lf(x)} = 0.1 x 10 + 0.2 x 5 + 0.7 x 10/7 = 3. (2.77) 

For each x, f(x) = 1/P(x), so 

£[1/P(x)]=£[f(x)}=3. (2.78) 

Solution to exercise 2.22 (p.37). For general X, 

£ [i/p(x)j = p(*n/p(x) = £ 1 = i^i- ( 2 - 79 ) 

Solution to exercise 2.23 (p.37). p o = 0.1, p b = 0.2, p c = 0.7. g(a) = 0, g{b) = l, 
and g(c) = 0. 

S[g{x)]=p b = 0.2. (2.80) 

Solution to exercise 2.24 (p.37). 

P (P(x) e [0.15, 0.5]) = p b = 0.2. (2.81) 
P(x) 



lOE 



0.2 



> 0.05 = Pa + Pc = 0.8. (2.82) 



Solution to exercise 2.25 (p.37). This type of question can be approached in 
two ways: either by differentiating the function to be maximized, finding the 
maximum, and proving it is a global maximum; this strategy is somewhat 
risky since it is possible for the maximum of a function to be at the boundary 
of the space, at a place where the derivative is not zero. Alternatively, a 
carefully chosen inequality can establish the answer. The second method is 
much neater. 

Proof by differentiation (not the recommended method). Since it is slightly 
easier to differentiate lnl/p than log 2 l/p, we temporarily define H(X) to be 
measured using natural logarithms, thus scaling it down by a factor of log 2 e. 

H(X) = 5>* ln ^ ( 2 - 83 ) 

™ = lni-1 (2.84) 
dpi Pi 

we maximize subject to the constraint J2iPi = 1 which can be enforced with 
a Lagrange multiplier: 



G(p) = H(X) + \(^2 Pi -lj 



(2.85) 



^ = Inl-l + A. (2.86) 

dpi Pi 

At a maximum, 

In— + A = (2.87) 

Pi 

^>ln— = 1-A, (2.88) 

Pi 

so all the pi are equal. That this extremum is indeed a maximum is established 
by finding the curvature: 

dpidpj pi 

which is negative definite. □ 
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Proof using Jensen's inequality (recommended method). First a reminder of 
the inequality. 

If / is a convex ' function and a; is a random variable then: 

£[f(x)]>f(£[x]). 

If / is strictly convex^-- and £ [f(xj\ = f (£{x]), then the random 
variable £ is a constant (with probability 1). 

The secret of a proof using Jensen's inequality is to choose the right func- 
tion and the right random variable. We could define 

f(u) = log - = - log u (2.90) 
u 

(which is a convex function) and think of H{X) = ^2pi log ^- as the mean of 
f(u) where u = P(x), but this would not get us there - it would give us an 
inequality in the wrong direction. If instead we define 

u=l/P{x) (2.91) 

then we find: 

H(X) = -£ [f(l/P(x))} < -f (£[1/P(x)}) ; (2.92) 
now we know from exercise 2.22 (p. 37) that £[1/P(x)} = \Ax\, so 

H(X)<-f(\A x \)=\og\Ax\- (2.93) 

Equality holds only if the random variable u = 1/P(x) is a constant, which 
means P (x) is a constant for all x. □ 

Solution to exercise 2.26 (p. 37). 

i?KL(P||Q) = E P ^)log^||. (2.94) 



We prove Gibbs' inequality using Jensen's inequality. Let f(u) = log 1/u and 

u = 2^1. Then 

P(x) 

Dkl(P\\Q) = £[f(Q(x)/P(x))} (2.95) 

>- '(Ef)|)->«yw)-* 

with equality only if u = is a constant, that is, if Q(x) = P{x). □ 

Second solution. In the above proof the expectations were with respect to 
the probability distribution P(x). A second solution method uses Jensen's 
inequality with Q(x) instead. We define f(u) = wlogw and let u = 
Then 

a< L (pii« - s »)Si„ g s= EQW/ (£gr) (2 . 97) 



>- /(Eow^) = /w = »- 



(2.98) 



with equality only if u = is a constant, that is, if Q(x) = P(x). □ 
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Solution to exercise 2.28 (p. 38). 

H(X) = H 2 (f) + fH 2 (g) + (1 - f)H 2 (h). 



(2.99) 



Solution to exercise 2.29 (p. 38). The probability that there are x — 1 tails and 
then one head (so we get the first head on the xth toss) is 



p{x) = (i - /r- 1 / 



(2.100) 



If the first toss is a tail, the probability distribution for the future looks just 
like it did before we made the first toss. Thus we have a recursive expression 
for the entropy: 

H(X) = H 2 (f) + (l-f)H(X). (2.101) 

Rearranging, 

H(X) = H 2 (f)/f. (2.102) 
Solution to exercise 2.34 (p. 38). The probability of the number of tails t is 



IV 1 



(2.103) 



The expected number of heads is 1 , by definition of the problem. The expected 
number of tails is 



4=0 



(2.104) 



which may be shown to be 1 in a variety of ways. For example, since the 
situation after one tail is thrown is equivalent to the opening situation, we can 
write down the recurrence relation 



1 



£[t] = -(!+£ 



1 



:0 



£[t] = 1. 



(2.105) 



The probability distribution of the 'estimator' / = 1/(1 + t), given that 
/ = 1/2, is plotted in figure 2.12. The probability of / is simply the probability 
of the corresponding value of t. 

Solution to exercise 2.35 (p. 38). 

(a) The mean number of rolls from one six to the next six is six (assuming 
we start counting rolls after the first of the two sixes). The probability 
that the next six occurs on the rth roll is the probability of not getting 
a six for r — 1 rolls multiplied by the probability of then getting a six: 



P(n = r) 



r-1 



, for re {1,2,3,...}. 



(2.106) 



This probability distribution of the number of rolls, r, may be called an 
exponential distribution, since 

P(n = r) = e- ar /Z, (2.107) 

where a = ln(6/5), and Z is a normalizing constant. 

(b) The mean number of rolls from the clock until the next six is six. 

(c) The mean number of rolls, going back in time, until the most recent six 
is six. 



0.5 - 

0.4 - 

0.3 - 

0.2 - 

0.1 - 

- 




P(f) 



Figure 2.12. The probability 
distribution of the estimator 
/= 1/(1 + t), given that/ = 
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(d) The mean number of rolls from the six before the clock struck to the six 
after the clock struck is the sum of the answers to (b) and (c), less one, 
that is, eleven. 

(e) Rather than explaining the difference between (a) and (d), let me give 
another hint. Imagine that the buses in Poissonville arrive indepen- 
dently at random (a Poisson process), with, on average, one bus every 
six minutes. Imagine that passengers turn up at bus-stops at a uniform 
rate, and are scooped up by the bus without delay, so the interval be- 
tween two buses remains constant. Buses that follow gaps bigger than 
six minutes become overcrowded. The passengers' representative com- 
plains that two-thirds of all passengers found themselves on overcrowded 
buses. The bus operator claims, 'no, no - only one third of our buses 
are overcrowded'. Can both these claims be true? 

Solution to exercise 2.38 (p. 39). 

Binomial distribution method. From the solution to exercise 1.2, p B = 
3/ 2 (l-/) + / 3 - 

Sum rule method. The marginal probabilities of the eight values of r are 
illustrated by: 

P(r = 000) = V2(l - /) 3 + V2/ 3 , (2.108) 
P(r = 001) = V2/(l - ff + l/2/ 2 (l - /) = V2/(l - /). (2.109) 
The posterior probabilities are represented by 

/ 3 



and 



P(s= 1 |r = 000) = 
P(s=l |r=001) = 



(l-/) 3 + / 3 
(I-/)/ 2 



/(l-/) 2 + / 2 (l-/) 
The probabilities of error in these representative cases are thus 

/ 3 



P(error | r = 000) 



(l"/) 3 + / 3 



(2.110) 
(2.111) 

(2.112) 



and 



P(error|r=001) = /. (2.113) 

Notice that while the average probability of error of R3 is about 3/ 2 , the 
probability (given r) that any particular bit is wrong is either about / 3 
or /. 

The average error probability, using the sum rule, is 



P(error) = P(r)P(error | r) 

r 

= 2[l/2(l - ff + V2/ 3 ] 



/ 3 



So 



P(error) 



(l-/) 3 + / 3 
/ 3 + 3/ 2 (l-/). 



■ 6[l/2/(l - /)]/. 




5 10 15 20 

Figure 2.13. The probability 
distribution of the number of rolls 
ri from one 6 to the next (falling 
solid line), 



5Y'- 1 1 



and the probability distribution 
(dashed line) of the number of 
rolls from the 6 before 1pm to the 
next 6, r tot , 



P(f tot = r) = r 



The probability P{j\ > 6) is 
about 1/3; the probability 
P(r tot > 6) is about 2/3. The 
mean of n is 6, and the mean of 
r to t is 11. 



The first two terms are for the 
cases r = 000 and 111; the 
remaining 6 are for the other 
outcomes, which share the same 
probability of occurring and 
identical error probability, /. 



Solution to exercise 2.39 (p. 40). The entropy is 9.7 bits per word. 
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About Chapter 3 



If you are eager to get on to information theory, data compression, and noisy 
channels, you can skip to Chapter 4. Data compression and data modelling 
are intimately connected, however, so you'll probably want to come back to 
this chapter by the time you get to Chapter 6. Before reading Chapter 3, it 
might be good to look at the following exercises. 

t> Exercise 3.1.^' P- 59 l A die is selected at random from two twenty-faced dice 
on which the symbols 1-10 are written with nonuniform frequency as 
follows. 



Symbol 


1 


2 


3 


4 


5 


6 


7 


8 9 


10 


Number of faces of die A 


6 


4 


3 


2 


1 


1 


1 


1 1 





Number of faces of die B 


3 


3 


2 


2 


2 


2 


2 


2 1 


1 



The randomly chosen die is rolled 7 times, with the following outcomes: 

5, 3, 9, 3, 8, 4, 7. 
What is the probability that the die is die A? 



> Exercise 3.2. ^ ' p ' ' Assume that there is a third twenty-faced die, die C, on 
which the symbols 1-20 are written once each. As above, one of the 
three dice is selected at random and rolled 7 times, giving the outcomes: 
3, 5, 4, 8, 3, 9, 7. 

What is the probability that the die is (a) die A, (b) die B, (c) die C? 

Exercise 3.3. p ' 48 -' Inferring a decay constant 

Unstable particles are emitted from a source and decay at a distance 
x, a real number that has an exponential probability distribution with 
characteristic length A. Decay events can only be observed if they occur 
in a window extending from x = 1cm to x = 20 cm. N decays are 
observed at locations {x\, . . . , xn}- What is A? 



* x * 

> Exercise 3.4. p ' 55 ] Forensic evidence 

Two people have left traces of their own blood at the scene of a crime. A 
suspect, Oliver, is tested and found to have type 'O' blood. The blood 
groups of the two traces are found to be of type 'O' (a common type 
in the local population, having frequency 60%) and of type 'AB' (a rare 
type, with frequency 1%). Do these data (type 'O' and 'AB' blood were 
found at scene) give evidence in favour of the proposition that Oliver 
was one of the two people present at the crime? 
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More about Inference 



It is not a controversial statement that Bayes' theorem provides the correct 
language for describing the inference of a message communicated over a noisy 
channel, as we used it in Chapter 1 (p. 6). But strangely, when it comes to 
other inference problems, the use of Bayes' theorem is not so widespread. 

► 3.1 A first inference problem 

When I was an undergraduate in Cambridge, I was privileged to receive su- 
pervisions from Steve Gull. Sitting at his desk in a dishevelled office in St. 
John's College, I asked him how one ought to answer an old Tripos question 
(exercise 3.3): 

Unstable particles are emitted from a source and decay at a 
distance x, a real number that has an exponential probability dis- 
tribution with characteristic length A. Decay events can only be 
observed if they occur in a window extending from x = 1 cm to 
x = 20cm. N decays are observed at locations {x\, . . . ,xn}- What 
is A? 

• • •• I 

X ' 

I had scratched my head over this for some time. My education had provided 
me with a couple of approaches to solving such inference problems: construct- 
ing 'estimators' of the unknown parameters; or 'fitting' the model to the data, 
or to a processed version of the data. 

Since the mean of an unconstrained exponential distribution is A, it seemed 
reasonable to examine the sample mean x = J2 n X n/N and see if an estimator 
A could be obtained from it. It was evident that the estimator A = x— 1 would 
be appropriate for A <C 20 cm, but not for cases where the truncation of the 
distribution at the right-hand side is significant; with a little ingenuity and 
the introduction of ad hoc bins, promising estimators for A 3> 20 cm could be 
constructed. But there was no obvious estimator that would work under all 
conditions. 

Nor could I find a satisfactory approach based on fitting the density P(x | A) 
to a histogram derived from the data. I was stuck. 

What is the general solution to this problem and others like it? Is it 
always necessary, when confronted by a new inference problem, to grope in the 
dark for appropriate 'estimators' and worry about finding the 'best' estimator 
(whatever that means)? 
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8 10 12 14 16 18 20 X 



X 

Steve wrote down the probability of one data point, given A: 

*/ x /Z(X) 



P(x\X) 



1 < x < 20 
otherwise 



where 



Z{X) = J dx\ e~ x l x = (e 



-l/A 



-20/ A 



This seemed obvious enough. Then he wrote Bayes' theorem: 



P(X | {X!,. . .,x N }) 



P({x}\X)P(X) 



P({x}) 
1 



(XZ(X)) 



N 



exp (- Ef x n /\) P(X). 



(3.1) 
(3.2) 

(3.3) 
(3.4) 



Suddenly, the straightforward distribution P({xi, . . . , xjv} | A), defining the 
probability of the data given the hypothesis A, was being turned on its head 
so as to define the probability of a hypothesis given the data. A simple figure 
showed the probability of a single data point P(x | A) as a familiar function of x, 
for different values of A (figure 3.1). Each curve was an innocent exponential, 
normalized to have area 1. Plotting the same function as a function of A for a 
fixed value of x, something remarkable happens: a peak emerges (figure 3.2). 
To help understand these two points of view of the one function, figure 3.3 
shows a surface plot of P(x | A) as a function of x and A. 

For a dataset consisting of several points, e.g., the six points {x}^ =1 = 
{1.5,2,3,4,5, 12}, the likelihood function P{{x] \ X) is the product of the N 
functions of A, P(x n | A) (figure 3.4). 
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Figure 3.1. The probability 
density P(x | A) as a function of x. 



Figure 3.2. The probability 
density P(x \ X) as a function of A, 
for three different values of x. 
When plotted this way round, the 
function is known as the likelihood 
of A. The marks indicate the 
three values of A, A = 2, 5, 10, that 
were used in the preceding figure. 




Figure 3.3. The probability 
density P(x | A) as a function of x 
and A. Figures 3.1 and 3.2 are 
vertical sections through this 
surface. 



1 .4e-06 




100 



Figure 3.4. The likelihood function 
in the case of a six-point dataset, 
P(hrj = {1.5,2,3,4,5,12}|A),as 
a function of A. 
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Steve summarized Bayes' theorem as embodying the fact that 



what you know about A after the data arrive is what you knew 
before [P(A)], and what the data told you [P({x} | A)]. 



Probabilities are used here to quantify degrees of belief. To nip possible 
confusion in the bud, it must be emphasized that the hypothesis A that cor- 
rectly describes the situation is not a stochastic variable, and the fact that the 
Bayesian uses a probability distribution P does not mean that he thinks of 
the world as stochastically changing its nature between the states described 
by the different hypotheses. He uses the notation of probabilities to represent 
his beliefs about the mutually exclusive micro- hypotheses (here, values of A), 
of which only one is actually true. That probabilities can denote degrees of 
belief, given assumptions, seemed reasonable to me. 

The posterior probability distribution (3.4) represents the unique and com- 
plete solution to the problem. There is no need to invent 'estimators'; nor do 
we need to invent criteria for comparing alternative estimators with each other. 
Whereas orthodox statisticians offer twenty ways of solving a problem, and an- 
other twenty different criteria for deciding which of these solutions is the best, 
Bayesian statistics only offers one answer to a well-posed problem. 

Assumptions in inference 

Our inference is conditional on our assumptions [for example, the prior P(X)]. 
Critics view such priors as a difficulty because they are 'subjective', but I don't 
see how it could be otherwise. How can one perform inference without making 
assumptions? I believe that it is of great value that Bayesian methods force 
one to make these tacit assumptions explicit. 

First, once assumptions are made, the inferences are objective and unique, 
reproducible with complete agreement by anyone who has the same informa- 
tion and makes the same assumptions. For example, given the assumptions 
listed above, H, and the data D, everyone will agree about the posterior prob- 
ability of the decay length A: 



P(X\D,H) 



P{D\\,H)P(\\H) 
P{D\H) 



(3.5) 



Second, when the assumptions are explicit, they are easier to criticize, and 
easier to modify - indeed, we can quantify the sensitivity of our inferences to 
the details of the assumptions. For example, we can note from the likelihood 
curves in figure 3.2 that in the case of a single data point at x = 5, the 
likelihood function is less strongly peaked than in the case x = 3; the details 
of the prior P(A) become increasingly important as the sample mean x gets 
closer to the middle of the window, 10.5. In the case x = 12, the likelihood 
function doesn't have a peak at all - such data merely rule out small values 
of A, and don't give any information about the relative probabilities of large 
values of A. So in this case, the details of the prior at the small- A end of things 
are not important, but at the large-A end, the prior is important. 

Third, when we are not sure which of various alternative assumptions is 
the most appropriate for a problem, we can treat this question as another 
inference task. Thus, given data D, we can compare alternative assumptions 
Tt using Bayes' theorem: 



If you have any difficulty 
understanding this chapter I 
recommend ensuring you arc 
happy with exercises 3.1 and 3.2 
(p. 47) then noting their similarity 
to exercise 3.3. 



(3.6) 
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3.2: The bent coin 

where / denotes the highest assumptions, which we are not questioning. 

Fourth, we can take into account our uncertainty regarding such assump- 
tions when we make subsequent predictions. Rather than choosing one partic- 
ular assumption TL* , and working out our predictions about some quantity t, 
P(t | D, H*, I), we obtain predictions that take into account our uncertainty 
about 7i by using the sum rule: 

P(t | D,I)=^2 P(t | D, H, I)P(H \D,I). (3.7) 
n 

This is another contrast with orthodox statistics, in which it is conventional 
to 'test' a default model, and then, if the test 'accepts the model' at some 
'significance level', to use exclusively that model to make predictions. 
Steve thus persuaded me that 



probability theory reaches parts that ad hoc methods cannot reach. 



Let's look at a few more examples of simple inference problems. 

► 3.2 The bent coin 

A bent coin is tossed F times; we observe a sequence s of heads and tails 
(which we'll denote by the symbols a and b). We wish to know the bias of 
the coin, and predict the probability that the next toss will result in a head. 
We first encountered this task in example 2.7 (p. 30), and we will encounter it 
again in Chapter 6, when we discuss adaptive data compression. It is also the 
original inference problem studied by Thomas Bayes in his essay published in 
1763. 

As in exercise 2.8 (p. 30), we will assume a uniform prior distribution and 
obtain a posterior distribution by multiplying by the likelihood. A critic might 
object, 'where did this prior come from?' I will not claim that the uniform 
prior is in any way fundamental; indeed we'll give examples of nonuniform 
priors later. The prior is a subjective assumption. One of the themes of this 
book is: 



you can't do inference - or data compression - without making 
assumptions. 



We give the name TL\ to our assumptions. [We'll be introducing an al- 
ternative set of assumptions in a moment.] The probability, given p a , that F 
tosses result in a sequence s that contains {F a , F b } counts of the two outcomes 
is 

P(s|p a ,F,W 1 )=pf(l-p/ b . (3.8) 

[For example, P(s = aaba | p a , F = 4, Hi) = p a p a (l — Pa.)Pa.-} Our first model 
assumes a uniform prior distribution for p a , 

P(p a |Wl)=l, Pa e [0,1] (3.9) 

and pb = 1 - p a . 

Inferring unknown parameters 

Given a string of length F of which F a are as and F b are bs, we are interested 
in (a) inferring what p a might be; (b) predicting whether the next character is 
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an a or a b. [Predictions are always expressed as probabilities. So 'predicting 
whether the next character is an a' is the same as computing the probability 
that the next character is an a.] 

Assuming Hi to be true, the posterior probability of p a , given a string s 
of length F that has counts {F a ,F b }, is, by Bayes' theorem, 

P( Pa \s,F,Hi) - p^fyRT) ' (3 - 10) 

The factor P(s \p a ,F,Hi), which, as a function of p a , is known as the likeli- 
hood function, was given in equation (3.8); the prior P(p a \ Hi) was given in 
equation (3.9). Our inference of p a is thus: 

= t&w- (3 ' u) 

The normalizing constant is given by the beta integral 

FJFJ. 



P( S \FH 1 ) = ['dp 
( 1 ' 'Wo P * Pa [ Pa) " T(F a + F h + 2) 



T(F a + F b + 2) (F a + F b + iy: 

(3.12) 

^Exercise 3.5. [2 ' p ' 59] Sketch the posterior probability P(p a | s= aba, F = 3). 

What is the most probable value of p a (i.e., the value that maximizes 
the posterior probability density)? What is the mean value of p a under 
this distribution? 

Answer the same questions for the posterior probability 
P(p a |s = bbb,F = 3). 



From inferences to predictions 

Our prediction about the next toss, the probability that the next toss is an a, 
is obtained by integrating over p a . This has the effect of taking into account 
our uncertainty about p a when making predictions. By the sum rule, 

P(z\s,F) = J Ap a P{B.\p a )P{p a \s,F). (3.13) 

The probability of an a given p a is simply p a , so 

P^ 1 -^ (3.14) 

(3.15) 



P(a|s,F)= / dp a p 



I 



dp a 



P(s\F) 

P(s\F) 
+ l)!F b ! 



L(F a + i^ b + 2) 
which is known as Laplace's rule. 



F a !F b ! 



1)! 



► 3.3 The bent coin and model comparison 

Imagine that a scientist introduces another theory for our data. He asserts 
that the source is not really a bent coin but is really a perfectly formed die with 
one face painted heads ('a') and the other five painted tails ('b'). Thus the 
parameter p a , which in the original model, Hi, could take any value between 
and 1, is according to the new hypothesis, Ho, not a free parameter at all; 
rather, it is equal to 1/6. [This hypothesis is termed Ho so that the suffix of 
each model indicates its number of free parameters.] 

How can we compare these two models in the light of data? We wish to 
infer how probable Hi is relative to Hq. 
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Model comparison as inference 

In order to perform model comparison, we write down Bayes' theorem again, 
but this time with a different argument on the left-hand side. We wish to 
know how probable Hi is given the data. By Bayes' theorem, 

Similarly, the posterior probability of Ho is 

*k%™ (3.18. 

The normalizing constant in both cases is P(s | F), which is the total proba- 
bility of getting the observed data. If Hi and Ho are the only models under 
consideration, this probability is given by the sum rule: 

P(s\F)=P(s\F,H 1 )P(Hi)+P(s\F,H )P(H ). (3.19) 

To evaluate the posterior probabilities of the hypotheses we need to assign 
values to the prior probabilities P(Hi) and P(Ho); in this case, we might 
set these to 1/2 each. And we need to evaluate the data-dependent terms 
P(s\ F,Hi) and P(s \ F, Ho)- We can give names to these quantities. The 
quantity P(s \ F, Hi) is a measure of how much the data favour Hi, and we 
call it the evidence for model Hi- We already encountered this quantity in 
equation (3.10) where it appeared as the normalizing constant of the first 
inference we made - the inference of p a given the data. 



How model comparison works: The evidence for a model is 
usually the normalizing constant of an earlier Bayesian inference. 



We evaluated the normalizing constant for model Hi in (3.12). The evi- 
dence for model Ho is very simple because this model has no parameters to 
infer. Defining po to be 1 /6, we have 

P(s\F,Ho)=P^(l- P o) Fh - (3.20) 



Thus the posterior probability ratio of model Hi to model Ho is 
P(Hi\s,F) _ P( s \F,Hi)P(Hi) 



(3.21) 



P(H \s,F) P( S \F,Ho)P(H ) 

' P ^(l- Po f\ (3.22) 



(F a + F b + 1)!, 

Some values of this posterior probability ratio are illustrated in table 3.5. The 
first five lines illustrate that some outcomes favour one model, and some favour 
the other. No outcome is completely incompatible with either model. With 
small amounts of data (six tosses, say) it is typically not the case that one of 
the two models is overwhelmingly more probable than the other. But with 
more data, the evidence against Ho given by any data set with the ratio F a : F b 
differing from 1:5 mounts up. You can't predict in advance how much data 
are needed to be pretty sure which theory is true. It depends what po is. 

The simpler model, Ho, since it has no adjustable parameters, is able to 
lose out by the biggest margin. The odds may be hundreds to one against it. 
The more complex model can never lose out by a large margin; there's no data 
set that is actually unlikely given model Hi- 
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F 


Data {F a , F h ) 


P(Wi|s,F) 

p(n \s,F) 




6 


(5,1) 


222.2 




6 


(3,3) 


2.67 




6 


(2,4) 


0.71 


= 1/1.4 


6 


(1,5) 


0.356 


= 1/2.8 


6 


(0,6) 


0.427 


= 1/2.3 


20 
20 
20 


(10,10) 
(3,17) 
(0,20) 


96.5 
0.2 
1.83 


= 1/5 



Table 3.5. Outcome of model 
comparison between models Hi 
and Ho for the 'bent coin'. Model 
Ho states that p a = 1/6, p b = 5/6. 



Ho is true 



Hi is true 



Pa = 1/6 

1000/1 
100/1 
10/1 

1/1 

"1/10 
1/100 
50 100 150 200 



1000/1 
100/1 
10/1 
1/1 
— 1/10 
1/100 
50 100 150 200 

1000/1 
100/1 
10/1 
1/1 
-1/10 
1/100 
50 100 150 200 



p a = 0.25 

1000/1 g 

,100/1 4 

10/1 2 

1/1 

1/10 "2 

1/100 " 4 
50 100 150 200 ( 

1000/1 ! 

100/1 4 

10/1 2 

1/1 

1/10 "2 

1/100 - 4 
200 ( 

1000/1 g 

100/1 4 

10/1 2 

1/1 

1/10 "2 

1/100 " 4 
50 100 150 200 ( 



50 100 150 




Pa = 0.5 

1000/1 
100/1 
10/1 

1/1 

1/10 
1/100 
50 100 150 200 

Ti 1000/1 
100/1 
10/1 

1/1 

1/10 
1/100 
50 100 150 200 

I: 1000/1 
100/1 
10/1 

1/1 

1/10 
1/100 
50 100 150 200 



Figure 3.6. Typical behaviour of 
the evidence in favour of Hi as 
bent coin tosses accumulate under 
three different conditions. 
Horizontal axis is the number of 
tosses, F. The vertical axis on the 
left is In t^ttttA- the right-hand 

P(s | F,H U ) ° 

vertical axis shows the values of 

P(s\F,H 1 ) 
P(s\F,H )' 

(See also figure 3.8, p. 60.) 



> Exercise 3.6.^ Show that after F tosses have taken place, the biggest value 
that the log evidence ratio 



log 



P(b\F,Hi) 
P(s\F,H ) 



(3.23) 



can have scales linearly with F if Hi is more probable, but the log 
evidence in favour of Ho can grow at most as log F. 

t> Exercise 3.7 } 3 ' p ' 60 l Putting your sampling theory hat on, assuming F& has 
not yet been measured, compute a plausible range that the log evidence 
ratio might lie in, as a function of F and the true value of p a , and sketch 
it as a function of F for p& = Po = 1/6, p a = 0.25, and p a = 1/2. [Hint: 
sketch the log evidence as a function of the random variable F a and work 
out the mean and standard deviation of F & .] 



Typical behaviour of the evidence 

Figure 3.6 shows the log evidence ratio as a function of the number of tosses, 
F, in a number of simulated experiments. In the left-hand experiments, Tio 
was true. In the right-hand ones, Hi was true, and the value of p a was either 
0.25 or 0.5. 

We will discuss model comparison more in a later chapter. 
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3.4: An example of legal evidence 



► 3.4 An example of legal evidence 

The following example illustrates that there is more to Bayesian inference than 
the priors. 



Two people have left traces of their own blood at the scene of a 
crime. A suspect, Oliver, is tested and found to have type 'O' 
blood. The blood groups of the two traces are found to be of type 
'O' (a common type in the local population, having frequency 60%) 
and of type 'AB' (a rare type, with frequency 1%). Do these data 
(type 'O' and 'AB' blood were found at scene) give evidence in 
favour of the proposition that Oliver was one of the two people 
present at the crime? 



A careless lawyer might claim that the fact that the suspect's blood type was 
found at the scene is positive evidence for the theory that he was present. But 
this is not so. 

Denote the proposition 'the suspect and one unknown person were present' 
by S. The alternative, S, states 'two unknown people from the population were 
present'. The prior in this problem is the prior probability ratio between the 
propositions S and S. This quantity is important to the final verdict and 
would be based on all other available information in the case. Our task here is 
just to evaluate the contribution made by the data D, that is, the likelihood 
ratio, P{D \ S, TL) /P{D \S,TL). In my view, a jury's task should generally be to 
multiply together carefully evaluated likelihood ratios from each independent 
piece of admissible evidence with an equally carefully reasoned prior proba- 
bility. [This view is shared by many statisticians but learned British appeal 
judges recently disagreed and actually overturned the verdict of a trial because 
the jurors had been taught to use Bayes' theorem to handle complicated DNA 
evidence.] 

The probability of the data given S is the probability that one unknown 
person drawn from the population has blood type AB: 



P(D\S,H)=pab (3.24) 

(since given S, we already know that one trace will be of type O). The prob- 
ability of the data given S is the probability that two unknown people drawn 
from the population have types O and AB: 

P(D\S,H) = 2 PoP AB- (3.25) 



In these equations H denotes the assumptions that two people were present 
and left blood there, and that the probability distribution of the blood groups 
of unknown people in an explanation is the same as the population frequencies. 
Dividing, we obtain the likelihood ratio: 

P(D\S,H) ~ 2^ " 2706 " °' 83 - (3 - 26) 

Thus the data in fact provide weak evidence against the supposition that 
Oliver was present. 

This result may be found surprising, so let us examine it from various 
points of view. First consider the case of another suspect, Alberto, who has 
type AB. Intuitively, the data do provide evidence in favour of the theory S' 
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that this suspect was present, relative to the null hypothesis S. And indeed 
the likelihood ratio in this case is: 

P{D\S',H) 1 

P{D\S,H) -2 PAB - 5 °- ^ 

Now let us change the situation slightly; imagine that 99% of people are of 
blood type O, and the rest are of type AB. Only these two blood types exist 
in the population. The data at the scene are the same as before. Consider 
again how these data influence our beliefs about Oliver, a suspect of type 
O, and Alberto, a suspect of type AB. Intuitively, we still believe that the 
presence of the rare AB blood provides positive evidence that Alberto was 
there. But does the fact that type O blood was detected at the scene favour 
the hypothesis that Oliver was present? If this were the case, that would mean 
that regardless of who the suspect is, the data make it more probable they were 
present; everyone in the population would be under greater suspicion, which 
would be absurd. The data may be compatible with any suspect of either 
blood type being present, but if they provide evidence for some theories, they 
must also provide evidence against other theories. 

Here is another way of thinking about this: imagine that instead of two 
people's blood stains there are ten, and that in the entire local population 
of one hundred, there are ninety type O suspects and ten type AB suspects. 
Consider a particular type O suspect, Oliver: without any other information, 
and before the blood test results come in, there is a one in 10 chance that he 
was at the scene, since we know that 10 out of the 100 suspects were present. 
We now get the results of blood tests, and find that nine of the ten stains are 
of type AB, and one of the stains is of type O. Does this make it more likely 
that Oliver was there? No, there is now only a one in ninety chance that he 
was there, since we know that only one person present was of type O. 

Maybe the intuition is aided finally by writing down the formulae for the 
general case where no blood stains of individuals of type O are found, and 
nAB of type AB, a total of N individuals in all, and unknown people come 
from a large population with fractions po,PAB- (There may be other blood 
types too.) The task is to evaluate the likelihood ratio for the two hypotheses: 
S, 'the type O suspect (Oliver) and N— 1 unknown others left N stains'; and 
S, L N unknowns left N stains'. The probability of the data under hypothesis 
S is just the probability of getting no , ^AB individuals of the two types when 
N individuals are drawn at random from the population: 

TV"' 

P(n Q , n AB | S) = , ' , Po>ab B - (3-28) 

In the case of hypothesis S, we need the distribution of the N — l other indi- 
viduals: 



^■"AB|5)= ( ^- 1 ^ | pg>- 1 ^- (3-29) 
The likelihood ratio is: 

P(no,riAB | S) no/N 



P(n ,n AB \S) po 



(3.30) 



This is an instructive result. The likelihood ratio, i.e. the contribution of 
these data to the question of whether Oliver was present, depends simply on 
a comparison of the frequency of his blood type in the observed data with the 
background frequency in the population. There is no dependence on the counts 
of the other types found at the scene, or their frequencies in the population. 
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If there are more type O stains than the average number expected under 
hypothesis S, then the data give evidence in favour of the presence of Oliver. 
Conversely if there are fewer type O stains than the expected number under 
S, then the data reduce the probability of the hypothesis that he was there. 
In the special case uq/N = po, the data contribute no evidence either way, 
regardless of the fact that the data are compatible with the hypothesis S. 

► 3.5 Exercises 

Exercise 3.8. p ' 60 -' The three doors, normal rules. 

On a game show, a contestant is told the rules as follows: 

There are three doors, labelled 1, 2, 3. A single prize has 
been hidden behind one of them. You get to select one door. 
Initially your chosen door will not be opened. Instead, the 
gameshow host will open one of the other two doors, and he 
will do so in such a way as not to reveal the prize. For example, 
if you first choose door 1, he will then open one of doors 2 and 
3, and it is guaranteed that he will choose which one to open 
so that the prize will not be revealed. 

At this point, you will be given a fresh choice of door: you 
can either stick with your first choice, or you can switch to the 
other closed door. All the doors will then be opened and you 
will receive whatever is behind your final choice of door. 

Imagine that the contestant chooses door 1 first; then the gameshow host 
opens door 3, revealing nothing behind the door, as promised. Should 
the contestant (a) stick with door 1, or (b) switch to door 2, or (c) does 
it make no difference? 

4t| Exercise 3.9. [2 ' >'-' il - The three doors, earthquake scenario. 

Imagine that the game happens again and just as the gameshow host is 
about to open one of the doors a violent earthquake rattles the building 
and one of the three doors flies open. It happens to be door 3, and it 
happens not to have the prize behind it. The contestant had initially 
chosen door 1. 

Repositioning his toupee, the host suggests, 'OK, since you chose door 
1 initially, door 3 is a valid door for me to open, according to the rules 
of the game; I'll let door 3 stay open. Let's carry on as if nothing 
happened.' 

Should the contestant stick with door 1, or switch to door 2, or does it 
make no difference? Assume that the prize was placed randomly, that 
the gameshow host does not know where it is, and that the door flew 
open because its latch was broken by the earthquake. 

[A similar alternative scenario is a gameshow whose confused host for- 
gets the rules, and where the prize is, and opens one of the unchosen 
doors at random. He opens door 3, and the prize is not revealed. Should 
the contestant choose what's behind door 1 or door 2? Does the opti- 
mal decision for the contestant depend on the contestant's beliefs about 
whether the gameshow host is confused or not?] 

> Exercise 3.10.'- 2 -' Another example in which the emphasis is not on priors. You 

visit a family whose three children are all at the local school. You don't 
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know anything about the sexes of the children. While walking clum- 
sily round the home, you stumble through one of the three unlabelled 
bedroom doors that you know belong, one each, to the three children, 
and find that the bedroom contains girlie stuff in sufficient quantities to 
convince you that the child who lives in that bedroom is a girl. Later, 
you sneak a look at a letter addressed to the parents, which reads 'From 
the Headmaster: we are sending this letter to all parents who have male 
children at the school to inform them about the following boyish mat- 
ters. . . '. 

These two sources of evidence establish that at least one of the three 
children is a girl, and that at least one of the children is a boy. What 
are the probabilities that there are (a) two girls and one boy; (b) two 
boys and one girl? 

> Exercise 3. II.'- 2 ' p ' 61 -' Mrs S is found stabbed in her family garden. Mr S 

behaves strangely after her death and is considered as a suspect. On 
investigation of police and social records it is found that Mr S had beaten 
up his wife on at least nine previous occasions. The prosecution advances 
this data as evidence in favour of the hypothesis that Mr S is guilty of the 
murder. 'Ah no,' says Mr S's highly paid lawyer, 'statistically, only one 
in a thousand wife-beaters actually goes on to murder his wife. 1 So the 
wife-beating is not strong evidence at all. In fact, given the wife-beating 
evidence alone, it's extremely unlikely that he would be the murderer of 
his wife - only a 1/1000 chance. You should therefore find him innocent.' 

Is the lawyer right to imply that the history of wife-beating does not 
point to Mr S's being the murderer? Or is the lawyer a slimy trickster? 
If the latter, what is wrong with his argument? 

[Having received an indignant letter from a lawyer about the preceding 
paragraph, I'd like to add an extra inference exercise at this point: Does 
my suggestion that Mr. S. 's lawyer may have been a slimy trickster imply 
that I believe all lawyers are slimy tricksters? (Answer: No.)] 

> Exercise 3.12.'- 2 -' A bag contains one counter, known to be either white or 

black. A white counter is put in, the bag is shaken, and a counter 
is drawn out, which proves to be white. What is now the chance of 
drawing a white counter? [Notice that the state of the bag, after the 
operations, is exactly identical to its state before.] 

> Exercise 3.13. [2 ' p ' 62] You move into a new house; the phone is connected, and 

you're pretty sure that the phone number is 740511, but not as sure as 
you would like to be. As an experiment, you pick up the phone and 
dial 740511; you obtain a 'busy' signal. Are you now more sure of your 
phone number? If so, how much? 

t> bxercise 3.14. [1] In a game, two coins are tossed. If either of the coins comes 
up heads, you have won a prize. To claim the prize, you must point to 
one of your coins that is a head and say 'look, that coin's a head, I've 
won'. You watch Fred play the game. He tosses the two coins, and he 

1 In the U.S.A., it is estimated that 2 million women are abused each year by their partners. 
In 1994, 4739 women were victims of homicide; of those, 1326 women (28%) were slain by 
husbands and boyfriends. 

(Sources: http : // www . umn . edu/mincava/ papers/ factoid . htm , 
http: //www. gunf ree . inter .net /vpc/womenfs .htm) 
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points to a coin and says 'look, that coin's a head, I've won'. What is 
the probability that the other coin is a head? 

> Exercise 3.15.^' p ' 63 ^ A statistical statement appeared in The Guardian on 
Friday January 4, 2002: 

When spun on edge 250 times, a Belgian one-euro coin came 
up heads 140 times and tails 110. 'It looks very suspicious 
to me', said Barry Blight, a statistics lecturer at the London 
School of Economics. 'If the coin were unbiased the chance of 
getting a result as extreme as that would be less than 7%'. 

But do these data give evidence that the coin is biased rather than fair? 
[Hint: see equation (3.22).] 

► 3.6 Solutions 

Solution to exercise 3.1 (p. 47). Let the data be D. Assuming equal prior 
probabilities, 

P(A\D) _ 1313121 _ 9 

P{B\D) 2 2 1 2 2 2 2 32 [ ' ' 

and P(A\D) = 9/41. 

Solution to exercise 3.2 (p. 47). The probability of the data given each hy- 
pothesis is: 

,,,,3121311 18 , , 

P(D\A) = = ^7; 3.32 

v 1 ; 20 20 20 20 20 20 20 20 7 y ' 

2222212 64 
P(D \B) = — — — — — — — = —-■ (3.33) 

v 1 ; 20 20 20 20 20 20 20 20 7 ' y ' 

P(D I C) = — — — — — — — = _L. (3.34) 

v 1 ; 20 20 20 20 20 20 20 20 7 v ; 

So 

^l^ = i8T^TT = I ; P ^ D) = t^ p V\°) = k 

(3.35) 




(a) 0.2 0.4 0.6 0.E 

P(p a I s = aba, F = Z) ocp^(l - 




(b) 0.2 0.4 0.6 0.8 1 

P(p a | s = bbb, F = 3) (X (1 - p a ) 3 



Figure 3.7. Posterior probability 
for the bias p a of a bent coin 
given two different data sets. 



Solution to exercise 3.5 (p. 52). 

(a) P(p a I s= aba, F = 3) oc p^(l — p a ). The most probable value of p a (i.e., 
the value that maximizes the posterior probability density) is 2/3. The 
mean value of p a is 3/5. 

See figure 3.7a. 
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(b) P(p a | s = bbb,i r = 3) oc (1 — p a ) 3 . The most probable value of p a (i.e., 
the value that maximizes the posterior probability density) is 0. The 
mean value of p a is 1/5. 

See figure 3.7b. 
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Solution to exercise 3.7 (p. 54). The curves in figure 3.8 were found by finding 
the mean and standard deviation of F a , then setting F a to the mean ± two 
standard deviations to get a 95% plausible range for F a , and computing the 
three corresponding values of the log evidence ratio. 

Solution to exercise 3.8 (p. 57). Let Hi denote the hypothesis that the prize is 
behind door i. We make the following assumptions: the three hypotheses Hi, 
H2 and H 3 are equiprobable a priori, i.e., 



Figure 3.8. Range of plausible 
values of the log evidence in 
favour of Hi as a function of F. 
The vertical axis on the left is 

i°gg!^; the right-hand 

vertical axis shows the values of 

P(s\F,H 1 ) 
P(s\F,Ho)' 

The solid line shows the log 
evidence if the random variable 
F a takes on its mean value, 
Fa = p a F. The dotted lines show 
(approximately) the log evidence 
if F a is at its 2.5th or 97.5th 
percentile. 

(See also figure 3.6, p. 54.) 



P(Hi) = P{H 2 ) = P(H 3 ) = 



(3.36) 



The datum we receive, after choosing door 1, is one of D = 3 and D = 2 (mean- 
ing door 3 or 2 is opened, respectively). We assume that these two possible 
outcomes have the following probabilities. If the prize is behind door 1 then 
the host has a free choice; in this case we assume that the host selects at 
random between D = 2 and D = 3. Otherwise the choice of the host is forced 
and the probabilities are and 1. 



P(D-- 
P(D-- 



= 2|Wi) = 
= 3|Wi) = 



Ml 
Ml 



P(D-- 
P(D-- 



= 2|W 2 ) = 
= 3|W 2 ) = 



P(D-- 
P(D-- 



= 2|W 3 ) = 

= 3|W 3 ) = 



(3.37) 



Now, using Bayes' theorem, we evaluate the posterior probabilities of the 
hypotheses: 

P(n i \D = 3)= P ( D = T i lT i) (3-38) 



I P(Hi |D = 3) = ™ I P(H 2 \D = 3) = a™ I P(H 3 \D = 3) = 

(3.39) 

The denominator P(D = 3) is (1/2) because it is the normalizing constant for 
this posterior distribution. So 

\P(Hi\D = 3) = V3 I P(H 2 \D = 3) = % I P(H 3 \D = 3) = 0. | 

(3.40) 

So the contestant should switch to door 2 in order to have the biggest chance 
of getting the prize. 

Many people find this outcome surprising. There are two ways to make it 
more intuitive. One is to play the game thirty times with a friend and keep 
track of the frequency with which switching gets the prize. Alternatively, you 
can perform a thought experiment in which the game is played with a million 
doors. The rules are now that the contestant chooses one door, then the game 



P(D = 3) 



(0)(l/3) 
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show host opens 999,998 doors in such a way as not to reveal the prize, leaving 
the contestant's selected door and one other door closed. The contestant may 
now stick or switch. Imagine the contestant confronted by a million doors, 
of which doors 1 and 234,598 have not been opened, door 1 having been the 
contestant's initial guess. Where do you think the prize is? 



Solution to exercise 3.9 (p.57). If door 3 is opened by an earthquake, the 
inference comes out differently - even though visually the scene looks the 
same. The nature of the data, and the probability of the data, are both 
now different. The possible data outcomes are, firstly, that any number of 
the doors might have opened. We could label the eight possible outcomes 
d = (0,0,0), (0,0,1), (0,1,0), (1,0,0), (0,1,1),... ,(1,1,1). Secondly, it might 
be that the prize is visible after the earthquake has opened one or more doors. 
So the data D consists of the value of d, and a statement of whether the prize 
was revealed. It is hard to say what the probabilities of these outcomes are, 
since they depend on our beliefs about the reliability of the door latches and 
the properties of earthquakes, but it is possible to extract the desired posterior 
probability without naming the values of P(d | Hi) for each d. All that matters 
are the relative values of the quantities P(D \ Hi), P(D \ H2), P(D \ H3), for 
the value of D that actually occurred. [This is the likelihood principle, which 
we met in section 2.3.] The value of D that actually occurred is 'd= (0,0, 1), 
and no prize visible'. First, it is clear that P(D \ H3) = 0, since the datum 
that no prize is visible is incompatible with Hs- Now, assuming that the 
contestant selected door 1, how does the probability P{D \ Hi) compare with 
P(D I W2)? Assuming that earthquakes are not sensitive to decisions of game 
show contestants, these two quantities have to be equal, by symmetry. We 
don't know how likely it is that door 3 falls off its hinges, but however likely 
it is, it's just as likely to do so whether the prize is behind door 1 or door 2. 
So, if P(D I Hi) and P(D \ H2) are equal, we obtain: 



P{HAD) = 

= 1/2 



p(n 2 \D) = 

= 1/2 



p(h 3 \d) 



_ P(D\H 3 )(i/3 
P(D) 

= 0. 

(3.41) 

The two possible hypotheses are now equally likely. 

If we assume that the host knows where the prize is and might be acting 
deceptively, then the answer might be further modified, because we have to 
view the host's words as part of the data. 

Confused? It's well worth making sure you understand these two gameshow 
problems. Don't worry, I slipped up on the second problem, the first time I 
met it. 

There is a general rule which helps immensely when you have a confusing 
probability problem: 



Always write down the probability of everything. 
(Steve Gull) 
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From this joint probability, any desired inference can be mechanically ob- 
tained (figure 3.9). 



Figure 3.9. The probability of 
everything, for the second 
three-door problem, assuming an 
earthquake has just occurred. 
Here, P3 is the probability that 
door 3 alone is opened by an 
earthquake. 



Solution to exercise 3.11 (p. 58). The statistic quoted by the lawyer indicates 
the probability that a randomly selected wife-beater will also murder his wife. 
The probability that the husband was the murderer, given that the wife has 
been murdered, is a completely different quantity. 
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To deduce the latter, we need to make further assumptions about the 
probability that the wife is murdered by someone else. If she lives in a neigh- 
bourhood with frequent random murders, then this probability is large and 
the posterior probability that the husband did it (in the absence of other ev- 
idence) may not be very large. But in more peaceful regions, it may well be 
that the most likely person to have murdered you, if you are found murdered, 
is one of your closest relatives. 

Let's work out some illustrative numbers with the help of the statistics 
on page 58. Let m= 1 denote the proposition that a woman has been mur- 
dered; h=l, the proposition that the husband did it; and 6=1, the propo- 
sition that he beat her in the year preceding the murder. The statement 
'someone else did it' is denoted by h = 0. We need to define P(h\m=l), 
P(b\h=l,m=l), and P(6 = l | h = 0, m = l) in order to compute the pos- 
terior probability P(h = 1 | 6= 1, m= 1). From the statistics, we can read 
out P(h=l\m=l) = 0.28. And if two million women out of 100 million 
are beaten, then P(b= 1 1 h = 0, m= 1) = 0.02. Finally, we need a value for 
P(b\ h= l,m= 1): if a man murders his wife, how likely is it that this is the 
first time he laid a finger on her? I expect it's pretty unlikely; so maybe 
P(b= 1 1 h = l,m= 1) is 0.9 or larger. 

By Bayes' theorem, then, 

P(h=l\b=l,m=l) = ' 9 X ' 28 — -95%. (3.42) 

v 1 ' ; .9 x .28 + .02 x .72 v ; 

One way to make obvious the sliminess of the lawyer on p. 58 is to construct 
arguments, with the same logical structure as his, that are clearly wrong. 
For example, the lawyer could say 'Not only was Mrs. S murdered, she was 
murdered between 4.02pm and 4.03pm. Statistically, only one in a million 
wife-beaters actually goes on to murder his wife between 4.02pm and 4.03pm. 
So the wife-beating is not strong evidence at all. In fact, given the wife-beating 
evidence alone, it's extremely unlikely that he would murder his wife in this 
way - only a 1/1,000,000 chance.' 

Solution to exercise 3.13 (p. 58). There are two hypotheses. Ho: your number 
is 740511; H\: it is another number. The data, D, are 'when I dialed 740511, 
I got a busy signal'. What is the probability of D, given each hypothesis? If 
your number is 740511, then we expect a busy signal with certainty: 

P(D\H ) = 1. 

On the other hand, if Hi is true, then the probability that the number dialled 
returns a busy signal is smaller than 1, since various other outcomes were also 
possible (a ringing tone, or a number-unobtainable signal, for example). The 
value of this probability P(D \ Hi) will depend on the probability a that a 
random phone number similar to your own phone number would be a valid 
phone number, and on the probability (3 that you get a busy signal when you 
dial a valid phone number. 

I estimate from the size of my phone book that Cambridge has about 
75 000 valid phone numbers, all of length six digits. The probability that a 
random six-digit number is valid is therefore about 75 000/10 6 = 0.075. If 
we exclude numbers beginning with 0, 1, and 9 from the random choice, the 
probability a is about 75 000/700 000 ~ 0.1. If we assume that telephone 
numbers are clustered then a misremembered number might be more likely 
to be valid than a randomly chosen number; so the probability, a, that our 
guessed number would be valid, assuming H\ is true, might be bigger than 
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0.1. Anyway, a must be somewhere between 0.1 and 1. We can carry forward 
this uncertainty in the probability and see how much it matters at the end. 

The probability (3 that you get a busy signal when you dial a valid phone 
number is equal to the fraction of phones you think are in use or off-the-hook 
when you make your tentative call. This fraction varies from town to town 
and with the time of day. In Cambridge, during the day, I would guess that 
about 1% of phones are in use. At 4am, maybe 0.1%, or fewer. 

The probability P(D \ Hi) is the product of a and (3, that is, about 0.1 x 
0.01 = 10~ 3 . According to our estimates, there's about a one-in-a-thousand 
chance of getting a busy signal when you dial a random number; or one-in-a- 
hundred, if valid numbers are strongly clustered; or one-in-10 4 , if you dial in 
the wee hours. 

How do the data affect your beliefs about your phone number? The pos- 
terior probability ratio is the likelihood ratio times the prior probability ratio: 



P(H \D) = P(D\H )P(H ) 
P(Ui\D) P(D\Hi)P(HiY 



(3.43) 



The likelihood ratio is about 100-to-l or 1000-to-l, so the posterior probability 
ratio is swung by a factor of 100 or 1000 in favour of Ho- If the prior probability 
of Ho was 0.5 then the posterior probability is 



P(n 1 d) 



P{Ui\D) 

p(n„ | d) 



0.99 or 0.999. 



(3.44) 



Solution to exercise 3.15 (p. 59). We compare the models Ho - the coin is fair 
- and H\ - the coin is biased, with the prior on its bias set to the uniform 
distribution P(p\Hi) = 1. [The use of a uniform prior seems reasonable 
to me, since I know that some coins, such as American pennies, have severe 
biases when spun on edge; so the situations p = 0.01 or p = 0.1 or p = 0.95 
would not surprise me.] 

When I mention Ho - the coin is fair - a pedant would say, 'how absurd to even 
consider that the coin is fair - any coin is surely biased to some extent'. And 
of course I would agree. So will pedants kindly understand Ho as meaning 'the 
coin is fair to within one part in a thousand, i.e., p € 0.5 ± 0.001'. 



The likelihood ratio is: 



P(D\Hi 



140!110! 
251! 



P{D\Ho) 1/2 250 



0.48. 



(3.45) 



Thus the data give scarcely any evidence either way; in fact they give weak 
evidence (two to one) in favour of Ho^ 

'No, no', objects the believer in bias, 'your silly uniform prior doesn't 
represent my prior beliefs about the bias of biased coins - I was expecting only 
a small bias'. To be as generous as possible to the Hi, let's see how well it 
could fare if the prior were presciently set. Let us allow a prior of the form 
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Figure 3.10. The probability 
distribution of the number of 
heads given the two hypotheses, 
that the coin is fair, and that it is 
biased, with the prior distribution 
of the bias being uniform. The 
outcome (D = 140 heads) gives 
weak evidence in favour of Ho, the 
hypothesis that the coin is fair. 



P(p\Hi,a) 



Z(a) 



p a - l {l-p) 



a-l 



where Z(a) = T{a) 2 /V{2a) (3.46) 



(a Beta distribution, with the original uniform prior reproduced by setting 
a = 1). By tweaking a, the likelihood ratio for Hi over Ho, 



P(D\Hi,a) _ r(140+a)r(110+a)r(2a)2 
P{D\Ho) 



2o() 



r(250+2a)r(a) 1 



(3.47) 
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can be increased a little. It is shown for several values of a in figure 3.11. 
Even the most favourable choice of a (a ~ 50) can yield a likelihood ratio of 
only two to one in favour of Hi. 

In conclusion, the data are not 'very suspicious'. They can be construed 
as giving at most two-to-one evidence in favour of one or other of the two 
hypotheses. 

Are these wimpy likelihood ratios the fault of over-restrictive priors? Is there 
any way of producing a 'very suspicious' conclusion? The prior that is best- 
matched to the data, in terms of likelihood, is the prior that sets p to / = 
140/250 with probability one. Let's call this model H*. The likelihood ratio is 
P(D\H*)/P(D\Ha) = 2 250 / 140 (l - /) 110 = 6.1. So the strongest evidence that 
these data can possibly muster against the hypothesis that there is no bias is 
six-to-onc. 

While we are noticing the absurdly misleading answers that 'sampling the- 
ory' statistics produces, such as the p- value of 7% in the exercise we just solved, 
let's stick the boot in. If we make a tiny change to the data set, increasing 
the number of heads in 250 tosses from 140 to 141, we find that the p- value 
goes below the mystical value of 0.05 (the p- value is 0.0497). The sampling 
theory statistician would happily squeak 'the probability of getting a result as 
extreme as 141 heads is smaller than 0.05 - we thus reject the null hypothesis 
at a significance level of 5%'. The correct answer is shown for several values 
of a in figure 3.12. The values worth highlighting from this table are, first, 
the likelihood ratio when Hi uses the standard uniform prior, which is 1:0.61 
in favour of the null hypothesis TLq. Second, the most favourable choice of a, 
from the point of view of Hi, can only yield a likelihood ratio of about 2.3:1 
in favour of Hi. 

Be warned! A p- value of 0.05 is often interpreted as implying that the odds 
are stacked about twenty-to-one against the null hypothesis. But the truth 
in this case is that the evidence either slightly favours the null hypothesis, or 
disfavours it by at most 2.3 to one, depending on the choice of prior. 

The p- values and 'significance levels' of classical statistics should be treated 
with extreme caution. Shun them! Here ends the sermon. 
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Figure 3.11. Likelihood ratio for 
various choices of the prior- 
distribution's hyperparameter a. 
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Figure 3.12. Likelihood ratio for 
various choices of the prior- 
distribution's hyperparameter a, 
when the data are D' = 141 heads 
in 250 trials. 
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About Chapter 4 



In this chapter we discuss how to measure the information content of the 
outcome of a random experiment. 

This chapter has some tough bits. If you find the mathematical details 
hard, skim through them and keep going - you'll be able to enjoy Chapters 5 
and 6 without this chapter's tools. 

Before reading Chapter 4, you should have read Chapter 2 and worked on 
exercises 2.21-2.25 and 2.16 (pp. 36-37), and exercise 4.1 below. 

The following exercise is intended to help you think about how to measure 
information content. 

Exercise 4.1.^' P- 69 ] - Please work on this problem before reading Chapter 4- 

You are given 12 balls, all equal in weight except for one that is either 
heavier or lighter. You are also given a two-pan balance to use. In each 
use of the balance you may put any number of the 12 balls on the left 
pan, and the same number on the right pan, and push a button to initiate 
the weighing; there are three possible outcomes: either the weights are 
equal, or the balls on the left are heavier, or the balls on the left are 
lighter. Your task is to design a strategy to determine which is the odd 
ball and whether it is heavier or lighter than the others in as few uses 
of the balance as possible. 

While thinking about this problem, you may find it helpful to consider 
the following questions: 

(a) How can one measure information? 

(b) When you have identified the odd ball and whether it is heavy or 
light, how much information have you gained? 

(c) Once you have designed a strategy, draw a tree showing, for each 
of the possible outcomes of a weighing, what weighing you perform 
next. At each node in the tree, how much information have the 
outcomes so far given you, and how much information remains to 
be gained? 

(d) How much information is gained when you learn (i) the state of a 
flipped coin; (ii) the states of two flipped coins; (iii) the outcome 
when a four-sided die is rolled? 

(e) How much information is gained on the first step of the weighing 
problem if 6 balls are weighed against the other 6? How much is 
gained if 4 are weighed against 4 on the first step, leaving out 4 
balls? 



Notation 
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The Source Coding Theorem 



► 4.1 How to measure the information content of a random variable? 

In the next few chapters, we'll be talking about probability distributions and 
random variables. Most of the time we can get by with sloppy notation, 
but occasionally, we will need precise notation. Here is the notation that we 
established in Chapter 2. 

An ensemble X is a triple (x, Ax, "Px)i where the outcome x is the value 
of a random variable, which takes on one of a set of possible values, 
Ax = {ai,a 2 , . . . ,Oi,. . . ,ai}, having probabilities V x = {pi,Pi, ■ ■ ■ ,Pi}, 
with P(x = ai) = pi, Pi>0 and Y, ateAx P{x = a,i) = 1. 

How can we measure the information content of an outcome x = ai from such 
an ensemble? In this chapter we examine the assertions 

1. that the Shannon information content, 

h(x = Qi) = log 2 — , (4.1) 

Pi 

is a sensible measure of the information content of the outcome x = en, 
and 

2. that the entropy of the ensemble, 

#(X) = V Pi log 2 -, (4.2) 

is a sensible measure of the ensemble's average information content. 



Figure 4.1. The Shannon 
information content h(p) = log 2 j } 
and the binary entropy function 
H 2 (p) = H(p,l-p) = 

P lo %2 I + (1 - P) l°g 2 O^i) aS a 

function of p. 



Figure 4.1 shows the Shannon information content of an outcome with prob- 
ability p, as a function of p. The less probable an outcome is, the greater 
its Shannon information content. Figure 4.1 also shows the binary entropy 
function, 

H 2 (p) = H(p, l-p) = plog 2 i + (1 - p) log 2 jy^y (4-3) 

which is the entropy of the ensemble X whose alphabet and probability dis- 
tribution are Ax = {a,b},Vx = {p, (1 — p)}- 




P h(p) H 2 (p) 

0.001 10.0 0.011 

0.01 6.6 0.081 

0.1 3.3 0.47 

0.2 2.3 0.72 

0.5 1.0 1.0 




67 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 

68 4 — The Source Coding Theorem 



Information content of independent random variables 

Why should log l/pi have anything to do with the information content? Why 
not some other function of p{l We'll explore this question in detail shortly, 
but first, notice a nice property of this particular function h(x) = log l/p{x). 

Imagine learning the value of two independent random variables, x and y. 
The definition of independence is that the probability distribution is separable 
into a product: 

P(x,y) = P(x)P(y). (4.4) 

Intuitively, we might want any measure of the 'amount of information gained' 
to have the property of additivity - that is, for independent random variables 
x and y, the information gained when we learn x and y should equal the sum 
of the information gained if x alone were learned and the information gained 
if y alone were learned. 

The Shannon information content of the outcome x, y is 

h(x, y) = log / = log 1 = log — !— + log — j— (4.5) 
P{x,y) P{x)P{y) P(x) P(y) 

so it does indeed satisfy 

h{x,y) = h(x) + h{y), if x and y are independent. (4-6) 

Exercise 4.2.^' p ' 86 ^ Show that, if x and y are independent, the entropy of the 
outcome x, y satisfies 

H(X,Y) = H(X)+H(Y). (4.7) 
In words, entropy is additive for independent variables. 

We now explore these ideas with some examples; then, in section 4.4 and 
in Chapters 5 and 6, we prove that the Shannon information content and the 
entropy are related to the number of bits needed to describe the outcome of 
an experiment. 



The weighing problem: designing informative experiments 

Have you solved the weighing problem (exercise 4.1, p. 66) yet? Are you sure? 
Notice that in three uses of the balance - which reads either 'left heavier', 
'right heavier', or 'balanced' - the number of conceivable outcomes is 3 3 = 27, 
whereas the number of possible states of the world is 24: the odd ball could 
be any of twelve balls, and it could be heavy or light. So in principle, the 
problem might be solvable in three weighings - but not in two, since 3 2 < 24. 

If you know how you can determine the odd weight and whether it is 
heavy or light in three weighings, then you may read on. If you haven't found 
a strategy that always gets there in three weighings, I encourage you to think 
about exercise 4.1 some more. 

Why is your strategy optimal? What is it about your series of weighings 
that allows useful information to be gained as quickly as possible? The answer 
is that at each step of an optimal procedure, the three outcomes ('left heavier', 
'right heavier', and 'balance') are as close as possible to equiprobable. An 
optimal solution is shown in figure 4.2. 

Suboptimal strategies, such as weighing balls 1-6 against 7-12 on the first 
step, do not achieve all outcomes with equal probability: these two sets of balls 
can never balance, so the only possible outcomes are 'left heavy' and 'right 
heavy'. Such a binary outcome rules out only half of the possible hypotheses, 
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Figure 4.2. An optimal solution to the weighing problem. At each step there are two boxes: the left 
box shows which hypotheses are still possible; the right box shows the balls involved in the 
next weighing. The 24 hypotheses are written 1 + ,...,12~, with, e.g., 1 + denoting that 
1 is the odd ball and it is heavy. Weighings are written by listing the names of the balls 
on the two pans, separated by a line; for example, in the first weighing, balls 1, 2, 3, and 
4 are put on the left-hand side and 5, 6, 7, and 8 on the right. In each triplet of arrows 
the upper arrow leads to the situation when the left side is heavier, the middle arrow to 
the situation when the right side is heavier, and the lower arrow to the situation when the 
outcome is balanced. The three points labelled * correspond to impossible outcomes. 
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so a strategy that uses such outcomes must sometimes take longer to find the 
right answer. 

The insight that the outcomes should be as near as possible to equiprobable 
makes it easier to search for an optimal strategy. The first weighing must 
divide the 24 possible hypotheses into three groups of eight. Then the second 
weighing must be chosen so that there is a 3:3:2 split of the hypotheses. 

Thus we might conclude: 



the outcome of a random experiment is guaranteed to be most in- 
formative if the probability distribution over outcomes is uniform. 



This conclusion agrees with the property of the entropy that you proved 
when you solved exercise 2.25 (p. 37): the entropy of an ensemble X is biggest 
if all the outcomes have equal probability p, = l/|.4x|- 

Guessing games 

In the game of twenty questions, one player thinks of an object, and the 
other player attempts to guess what the object is by asking questions that 
have yes/no answers, for example, 'is it alive?', or 'is it human?' The aim 
is to identify the object with as few questions as possible. What is the best 
strategy for playing this game? For simplicity, imagine that we are playing 
the rather dull version of twenty questions called 'sixty-three'. 

Example 4.3. The game 'sixty-three'. What's the smallest number of yes/no 
questions needed to identify an integer x between and 63? 

Intuitively, the best questions successively divide the 64 possibilities into equal 
sized sets. Six questions suffice. One reasonable strategy asks the following 
questions: 



1 


is 


x > 32? 


2 


is 


a; mod 32 > 16? 


3 


is 


a; mod 16 > 8? 


4 


is 


xmod8 > 4? 


5 


is 


xmod4 > 2? 


6 


is 


xmod2 = 1? 



[The notation a; mod 32, pronounced l x modulo 32', denotes the remainder 
when x is divided by 32; for example, 35 mod 32 = 3 and 32 mod 32 = 0.] 

The answers to these questions, if translated from {yes, no} to {1, 0}, give 
the binary expansion of x, for example 35 => 100011. □ 

What are the Shannon information contents of the outcomes in this ex- 
ample? If we assume that all values of x are equally likely, then the answers 
to the questions are independent and each has Shannon information content 
log 2 (l/0.5) = lbit; the total Shannon information gained is always six bits. 
Furthermore, the number x that we learn from these questions is a six-bit bi- 
nary number. Our questioning strategy defines a way of encoding the random 
variable X cLS EL binary file. 

So far, the Shannon information content makes sense: it measures the 
length of a binary file that encodes x. However, we have not yet studied 
ensembles where the outcomes have unequal probabilities. Does the Shannon 
information content make sense there too? 
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X 


X 


X 


X 


X 



move # 


1 


2 


32 


48 


49 


question 


G3 


Bl 


E5 


F3 


H3 


outcome 


x = n 


x = n 


x = n 


x = n 


x = y 


P{x) 


63 


62 


32 


16 


1 


64 


63 


33 


17 


16 


h{x) 


0.0227 


0.0230 


0.0443 


0.0874 


4.0 


Total info. 


0.0227 


0.0458 


1.0 


2.0 


6.0 



The game of submarine: how many bits can one bit convey? 

In the game of battleships, each player hides a fleet of ships in a sea represented 
by a square grid. On each turn, one player attempts to hit the other's ships by 
firing at one square in the opponent's sea. The response to a selected square 
such as 'G3' is either 'miss', 'hit', or 'hit and destroyed'. 

In a boring version of battleships called submarine, each player hides just 
one submarine in one square of an eight-by-eight grid. Figure 4.3 shows a few 
pictures of this game in progress: the circle represents the square that is being 
fired at, and the xs show squares in which the outcome was a miss, x = n; the 
submarine is hit (outcome x = y shown by the symbol s) on the 49th attempt. 

Each shot made by a player defines an ensemble. The two possible out- 
comes are {y,n}, corresponding to a hit and a miss, and their probabili- 
ties depend on the state of the board. At the beginning, P(j) = 1/64 and 
P(n) = 63/64. At the second shot, if the first shot missed, -P(y) = 1/63 and 
P(n) = 62/63. At the third shot, if the first two shots missed, P(y) = 1/62 
and P(n) = 61/62. 

The Shannon information gained from an outcome x is h(x) = log(l/P(x)). 
If we are lucky, and hit the submarine on the first shot, then 



h{x) = /i(i)(y) = log 2 64 = 6 bits. 



(4.8) 



Now, it might seem a little strange that one binary outcome can convey six 
bits. But we have learnt the hiding place, which could have been any of 64 
squares; so we have, by one lucky binary question, indeed learnt six bits. 

What if the first shot misses? The Shannon information that we gain from 
this outcome is 

64 

h{x) = ht An) = log 2 — = 0.0227 bits. (4.9) 
v ' 63 

Does this make sense? It is not so obvious. Let's keep going. If our second 
shot also misses, the Shannon information content of the second outcome is 



h(2) (n) = log 2 



63 
62 



0.0230 bits. 



(4.10) 



If we miss thirty- two times (firing at a new square each time), the total Shan- 
non information gained is 

64 , 63 , 33 

! 63+log 2 - + --- + log 2 - 

0.0227 + 0.0230 + ••• + 0.0430 = 1.0 bits. (4.11) 



loe 



Figure 4.3. A game of submarine. 

The submarine is hit on the 49th 
attempt. 
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Why this round number? Well, what have we learnt? We now know that the 
submarine is not in any of the 32 squares we fired at; learning that fact is just 
like playing a game of sixty-three (p. 70), asking as our first question 'is x 
one of the thirty-two numbers corresponding to these squares I fired at?', and 
receiving the answer 'no'. This answer rules out half of the hypotheses, so it 
gives us one bit. 

After 48 unsuccessful shots, the information gained is 2 bits: the unknown 
location has been narrowed down to one quarter of the original hypothesis 
space. 

What if we hit the submarine on the 49th shot, when there were 16 squares 
left? The Shannon information content of this outcome is 

/j (4g) (y) = log 2 16 = 4.0 bits. (4.12) 

The total Shannon information content of all the outcomes is 

64 , 63 17 , 16 

log 2 - + log 2 - + --- + log 2 - + log 2T 

= 0.0227 + 0.0230 +••• + 0.0874 + 4.0 = 6.0 bits. (4.13) 

So once we know where the submarine is, the total Shannon information con- 
tent gained is 6 bits. 

This result holds regardless of when we hit the submarine. If we hit it 
when there are n squares left to choose from - n was 16 in equation (4.13) - 
then the total information gained is: 

64 , 63 , n + 1 , n 

lo S2 ™ + lo S2 777 + • • • + lo S2 + lo S2 t 

63 62 n ' 

= log; 



64 63 n + 1 n 

X X • • • X X — 

63 62 n 1 



64 

log 2 y = 6 bits. (4.14) 



What have we learned from the examples so far? I think the submarine 
example makes quite a convincing case for the claim that the Shannon infor- 
mation content is a sensible measure of information content. And the game of 
sixty-three shows that the Shannon information content can be intimately 
connected to the size of a file that encodes the outcomes of a random experi- 
ment, thus suggesting a possible connection to data compression. 

In case you're not convinced, let's look at one more example. 



The Wenglish language 

Wenglish is a language similar to English. Wenglish sentences consist of words 
drawn at random from the Wenglish dictionary, which contains 2 15 = 32,768 
words, all of length 5 characters. Each word in the Wenglish dictionary was 
constructed at random by picking five letters from the probability distribution 
over a. . .z depicted in figure 2.1. 

Some entries from the dictionary are shown in alphabetical order in fig- 
ure 4.4. Notice that the number of words in the dictionary (32,768) is 
much smaller than the total number of possible words of length 5 letters, 
26 5 ~ 12,000,000. 

Because the probability of the letter z is about 1/1000, only 32 of the 
words in the dictionary begin with the letter z. In contrast, the probability 
of the letter a is about 0.0625, and 2048 of the words begin with the letter a. 
Of those 2048 words, two start az, and 128 start aa. 

Let's imagine that we are reading a Wenglish document, and let's discuss 
the Shannon information content of the characters as we acquire them. If we 
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1 


aaail 


2 


aaaiu 


3 


aaald 


129 


abati 


2047 


azpan 


2048 


aztdn 


16 384 


odrcr 


32 737 


zatnt 


32 768 


zxast 



Figure 4.4. The Wenglish 
dictionary. 
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are given the text one word at a time, the Shannon information content of 
each five-character word is log 32,768 = 15 bits, since Wenglish uses all its 
words with equal probability. The average information content per character 
is therefore 3 bits. 

Now let's look at the information content if we read the document one 
character at a time. If, say, the first letter of a word is a, the Shannon 
information content is log 1 /0.0625 ~ 4 bits. If the first letter is z, the Shannon 
information content is log 1/0.001 ~ 10 bits. The information content is thus 
highly variable at the first character. The total information content of the 5 
characters in a word, however, is exactly 15 bits; so the letters that follow an 
initial z have lower average information content per character than the letters 
that follow an initial a. A rare initial letter such as z indeed conveys more 
information about what the word is than a common initial letter. 

Similarly, in English, if rare characters occur at the start of the word (e.g. 
xyl. . .), then often we can identify the whole word immediately; whereas 
words that start with common characters (e.g. pro . . . ) require more charac- 
ters before we can identify them. 



► 4.2 Data compression 

The preceding examples justify the idea that the Shannon information content 
of an outcome is a natural measure of its information content. Improbable out- 
comes do convey more information than probable outcomes. We now discuss 
the information content of a source by considering how many bits are needed 
to describe the outcome of an experiment. 

If we can show that we can compress data from a particular source into 
a file of L bits per source symbol and recover the data reliably, then we will 
say that the average information content of that source is at most L bits per 
symbol. 



Example: compression of text files 



A file is composed of a sequence of bytes. A byte is composed of 8 bits and 
can have a decimal value between and 255. A typical text file is composed 
of the ASCII character set (decimal values to 127). This character set uses 
only seven of the eight bits in a byte. 

> Exercise 4.4.^' P- 86 l By how much could the size of a file be reduced given 
that it is an ASCII file? How would you achieve this reduction? 

Intuitively, it seems reasonable to assert that an ASCII file contains 7/8 as 
much information as an arbitrary file of the same size, since we already know 
one out of every eight bits before we even look at the file. This is a simple ex- 
ample of redundancy. Most sources of data have further redundancy: English 
text files use the ASCII characters with non-equal frequency; certain pairs of 
letters are more probable than others; and entire words can be predicted given 
the context and a semantic understanding of the text. 



Here we use the word 'bit' with its 
meaning, 'a symbol with two 
values', not to be confused with 
the unit of information content. 



Some simple data compression methods that define measures of informa- 
tion content 

One way of measuring the information content of a random variable is simply 
to count the number of possible outcomes, |^4x|- (The number of elements in 
a set A is denoted by |„4|.) If we gave a binary name to each outcome, the 
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length of each name would be log 2 \ Ax\ bits, if \Ax\ happened to be a power 
of 2. We thus make the following definition. 

The raw bit content of X is 

floW=Iog 2 |^|. (4-15) 

Hq(X) is a lower bound for the number of binary questions that are always 
guaranteed to identify an outcome from the ensemble X. It is an additive 
quantity: the raw bit content of an ordered pair x, y, having |-A;r||.4y| possible 
outcomes, satisfies 

H (X,Y) = H (X) + H (Y). (4.16) 

This measure of information content does not include any probabilistic 
element, and the encoding rule it corresponds to does not 'compress' the source 
data, it simply maps each outcome to a constant-length binary string. 

Exercise 4.5. [2 ' p ' 86] Could there b e a compressor that maps an outcome x to 
a binary code c(x), and a decompressor that maps c back to x, such 
that every possible outcome is compressed into a binary code of length 
shorter than H (X) bits? 

Even though a simple counting argument shows that it is impossible to make 
a reversible compression program that reduces the size of all files, ama- 
teur compression enthusiasts frequently announce that they have invented 
a program that can do this - indeed that they can further compress com- 
pressed files by putting them through their compressor several times. Stranger 
yet, patents have been granted to these modern-day alchemists. See the 
comp. compression frequently asked questions for further reading. 1 

There are only two ways in which a 'compressor' can actually compress 
files: 

1. A lossy compressor compresses some files, but maps some files to the 
same encoding. We'll assume that the user requires perfect recovery of 
the source file, so the occurrence of one of these confusable files leads 
to a failure (though in applications such as image compression, lossy 
compression is viewed as satisfactory). We'll denote by S the probability 
that the source string is one of the confusable files, so a lossy compressor 
has a probability 6 of failure. If S can be made very small then a lossy 
compressor may be practically useful. 

2. A lossless compressor maps all files to different encodings; if it shortens 
some files, it necessarily makes others longer. We try to design the 
compressor so that the probability that a file is lengthened is very small, 
and the probability that it is shortened is large. 

In this chapter we discuss a simple lossy compressor. In subsequent chapters 
we discuss lossless compression methods. 

► 4.3 Information content defined in terms of lossy compression 

Whichever type of compressor we construct, we need somehow to take into 
account the probabilities of the different outcomes. Imagine comparing the 
information contents of two text files - one in which all 128 ASCII characters 




x http : // sunsite . org.uk/public/usenet/news-faqs/comp . compression/ 
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are used with equal probability, and one in which the characters are used with 
their frequencies in English text. Can we define a measure of information 
content that distinguishes between these two files? Intuitively, the latter file 
contains less information per character because it is more predictable. 

One simple way to use our knowledge that some symbols have a smaller 
probability is to imagine recoding the observations into a smaller alphabet 
- thus losing the ability to encode some of the more improbable symbols - 
and then measuring the raw bit content of the new alphabet. For example, 
we might take a risk when compressing English text, guessing that the most 
infrequent characters won't occur, and make a reduced ASCII code that omits 
the characters { ! , @, #, '/,, *, ~, <, >, /, \, _, {, }, [, ] , I }, thereby reducing 
the size of the alphabet by seventeen. The larger the risk we are willing to 
take, the smaller our final alphabet becomes. 

We introduce a parameter 6 that describes the risk we are taking when 
using this compression method: 5 is the probability that there will be no 
name for an outcome x. 



* }. 

Jl 1 1 1 i i i il 

t 4' 4' 4' 16' 64' 64' 64' 64 J • 



{a, b, 
l 



f, 
l 



(4.17) 



Example 4.6. Let 

A x 

and Vx : 

The raw bit content of this ensemble is 3 bits, corresponding to 8 binary 
names. But notice that P(x € {a, b, c,d}) = 15/16. So if we are willing 
to run a risk of 6 = 1/16 of not having a name for x, then we can get 
by with four names - half as many names as are needed if every x € Ax 
has a name. 

Table 4.5 shows binary names that could be given to the different out- 
comes in the cases 6 = and S = 1/16. When 5 = we need 3 bits to 
encode the outcome; when 6 = 1/16 we need only 2 bits. 

Let us now formalize this idea. To make a compression strategy with risk 
6, we make the smallest possible subset Sg such that the probability that x is 
not in Sg is less than or equal to 6, i.e., P(x Sg) < 6. For each value of 6 
we can then define a new measure of information content - the log of the size 
of this smallest subset Sg. [In ensembles in which several elements have the 
same probability, there may be several smallest subsets that contain different 
elements, but all that matters is their sizes (which are equal), so we will not 
dwell on this ambiguity] 

The smallest (5-sufflcient subset Sg is the smallest subset of Ax satisfying 

P(x eSg)>l-6. (4.18) 

The subset Sg can be constructed by ranking the elements of Ax in order of 
decreasing probability and adding successive elements starting from the most 
probable elements until the total probability is > (1 — 6). 

We can make a data compression code by assigning a binary name to each 
element of the smallest sufficient subset. This compression scheme motivates 
the following measure of information content: 

The essential bit content of X is: 

Hg{X) = \og 2 \Sg\. (4.19) 

Note that H (X) is the special case of Hg(X) with S = (if P(x) > for all 
x € Ax)- [Caution: do not confuse Hq(X) and Hg(X) with the function i?2(p) 
displayed in figure 4.1.] 

Figure 4.6 shows Hg(X) for the ensemble of example 4.6 as a function of 

6. 



6 = 


6 = 


1/16 


x c(x) 


X 


c(x) 


a 000 


a 


00 


b 001 


b 


01 


c 010 


c 


10 


d 011 


d 


11 


e 100 


e 




f 101 


f 




g 110 


g 




h 111 


h 





Table 4.5. Binary names for the 
outcomes, for two failure 
probabilities 5. 
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-6 



e,f,g,h 



-4 



-2.4 



log 2 P(x) 




a.b.c 
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Figure 4.6. (a) The outcomes of X 
(from example 4.6 (p. 75)), ranked 
by their probability, (b) The 
essential bit content H$(X). The 
labels on the graph show the 
smallest sufficient set as a 
function of 5. Note H (X) = 3 
bits and H 1 / 16 (X) = 2 bits. 



(a) 



H 5 (X) 



(b) 



{a,b,c,d,e,f,g,h} 

- {a,b,c,d,e,f,g} 

— {a,b,c,d,e,f} 
{a,b,c,d,e} 



{a.b.cd} 



{a.b.c} 



- (a.b) 



.fa) 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



Extended ensembles 

Is this compression method any more useful if we compress blocks of symbols 
from a source? 

We now turn to examples where the outcome x = (xi,X2, ■ ■ ■ ,xn) is a 
string of N independent identically distributed random variables from a single 
ensemble X. We will denote by X N the ensemble (Xi,X 2 , ... , X^). Remem- 
ber that entropy is additive for independent variables (exercise 4.2 (p. 68)), so 
H{X N ) = NH{X). 

Example 4.7. Consider a string of N flips of a bent coin, x = [x\, x 2 , • • • , xn), 
where x n € {0, l}, with probabilities po = 0.9, pi = 0.1. The most prob- 
able strings x are those with most 0s. If r(x) is the number of Is in x 
then 

T>t \ N—r(x) r(x) „„% 

n x ) =Pq Pi ■ (4-20) 

To evaluate Hg(X N ) we must find the smallest sufficient subset Sg. This 
subset will contain all x with r(x) = 0, 1, 2, . . . , up to some r max (S) — 1, 
and some of the x with r(x) = r max (5). Figures 4.7 and 4.8 show graphs 
of Hg{X ) against 5 for the cases 7Y = 4 and N = 10. The steps are the 
values of 5 at which \Sg\ changes by 1, and the cusps where the slope of 
the staircase chang es are the points where r max changes by 1. 

Exercise 4.8. [2 > p ' 86] What are the mathematical shapes of the curves between 
the cusps? 

For the examples shown in figures 4.6-4.8, Hg(X N ) depends strongly on 
the value of S, so it might not seem a fundamental or useful definition of 
information content. But we will consider what happens as N, the number 
of independent variables in X N , increases. We will find the remarkable result 
that Hg(X N ) becomes almost independent of S - and for all S it is very close 
to NH(X), where H(X) is the entropy of one of the random variables. 

Figure 4.9 illustrates this asymptotic tendency for the binary ensemble of 
example 4.7. As N increases, jjHs(X N ) becomes an increasingly flat function, 
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Figure 4.7. (a) The sixteen 
outcomes of the ensemble X 4 with 
Pi =0.1, ranked by probability, 
(b) The essential bit content 
Hg(X 4 ). The upper schematic 
diagram indicates the strings' 
probabilities by the vertical lines' 
lengths (not to scale). 
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binary variables with p\ = 0.1. 



hH s (X») M 



N 





N=10 




N-210 




N-410 




N-610 




N=810 




N=1010 















Figure 4.9. ±H S (X N ) for 
N = 10, 210,..., 1010 binary 
variables with pi = 0.1. 
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except for tails close to S = and 1. As long as we are allowed a tiny 
probability of error 5, compression down to NH bits is possible. Even if we 
are allowed a large probability of error, we still can compress only down to 
NH bits. This is the source coding theorem. 

Theorem 4.1 Shannon's source coding theorem. Let X be an ensemble with 
entropy H(X) = H bits. Given e > and < S < 1, there exists a positive 
integer Nq such that for N > Nq, 



^H 5 (X N ) - H 



< e. (4.21) 



► 4.4 Typicality 



Why does increasing N help? Let's examine long strings from X N . Table 4.10 
shows fifteen samples from X N for N = 100 and pi = 0.1. The probability 
of a string x that contains r Is and N—r 0s is 

P(x)=p r 1 (l- Pl ) N - r . (4.22) 

The number of strings that contain r Is is 

n(r)=( N ). (4.23) 



, r , 

So the number of Is, r, has a binomial distribution: 

p W=(f)rf(l-Pi) JV_r - (4-24) 

These functions are shown in figure 4.11. The mean of r is Npi, and its 
standard deviation is y/Npi(l — p\) (p.l). If N is 100 then 



r ~ N Pl ± y/Np^l - pi) ~ 10 ± 3. 



(4.25) 
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4.4: Typicality 




Figure 4.11. Anatomy of the typical set T. For pi = 0.1 and N = 100 and N = 1000, these graphs 
show n(r), the number of strings containing r Is; the probability P(x) of a single string 
that contains r Is; the same probability on a log scale; and the total probability n(r)P(x) of 
all strings that contain r Is. The number r is on the horizontal axis. The plot of log 2 -P(x) 
also shows by a dotted line the mean value of log 2 P(x) = —NH 2 (pi) which equals —46.9 
when N = 100 and —469 when N = 1000. The typical set includes only the strings that 
have log 2 .P(x) close to this value. The range marked T shows the set T^p (as defined in 
section 4.4) for N = 100 and (3 = 0.29 (left) and iV = 1000, f3 = 0.09 (right). 
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If N = 1000 then 

r~100±10. (4.26) 

Notice that as N gets bigger, the probability distribution of r becomes more 
concentrated, in the sense that while the range of possible values of r grows 
as 7V, the standard deviation of r grows only as y/~N. That r is most likely to 
fall in a small range of values implies that the outcome x is also most likely to 
fall in a corresponding small subset of outcomes that we will call the typical 
set. 

Definition of the typical set 

Let us define typicality for an arbitrary ensemble X with alphabet Ax- Our 
definition of a typical string will involve the string's probability. A long string 
of N symbols will usually contain about p\N occurrences of the first symbol, 
P2N occurrences of the second, etc. Hence the probability of this string is 
roughly 

P(x) typ = P( Xl )P(x 2 )P(x 3 ) . . . P(x N ) ~ p^ N) p^ N) . . .ppW (4.27) 
so that the information content of a typical string is 

log 2 -i- ~ N P l log 2 - - NH. (4.28) 

* W i Pi 

So the random variable log 2 VP(x), which is the information content of x, is 
very likely to be close in value to NH. We build our definition of typicality 
on this observation. 

We define the typical elements of Ax to be those elements that have prob- 
ability close to 2~ NH . (Note that the typical set, unlike the smallest sufficient 
subset, does not include the most probable elements of Ax, but we will show 
that these most probable elements contribute negligible probability.) 

We introduce a parameter (3 that defines how close the probability has to 
be to 2~ NH for an element to be 'typical'. We call the set of typical elements 
the typical set, T N p: 



N p = <. x G Ax 



— logo — ^— — H 
N m P(x) 



< (5 . (4.29) 



We will show that whatever value of j3 we choose, the typical set contains 
almost all the probability as N increases. 

This important result is sometimes called the 'asymptotic equipartition' 
principle. 

'Asymptotic equipartition' principle. For an ensemble of TV independent 
identically distributed (i.i.d.) random variables X N = (X\, X2, ■ ■ • , Xn), 
with N sufficiently large, the outcome x = (x\, X2, ■ ■ ■ , xn) is almost 
certain to belong to a subset of Ax having only 2 NH ( X ' members, each 
having probability 'close to' 2~ NH ( X \ 

Notice that if H(X) < H (X) then 2 NH ( X ~) is a tiny fraction of the number 
of possible outcomes \A X \ = \A X \ N = 2 NHa{ - x \ 

The term equipartition is chosen to describe the idea that the members of 
the typical set have roughly equal probability. [This should not be taken too 
literally, hence my use of quotes around 'asymptotic equipartition'; sec page 
83.] 

A second meaning for equipartition, in thermal physics, is the idea that each 
degree of freedom of a classical system has equal average energy, \kT . This 
second meaning is not intended here. 
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0000000000000. . . 00000000000 

Figure 4.12. Schematic diagram 

The 'asymptotic equipartition' principle is equivalent to: showing all strings in the ensemble 

X ranked by their probability, 
Shannon's source coding theorem (verbal statement). N i.i.d. ran- and the typical set T/v/3- 
dom variables each with entropy H(X) can be compressed into more 
than NH(X) bits with negligible risk of information loss, as N — > oo; 
conversely if they are compressed into fewer than NH(X) bits it is vir- 
tually certain that information will be lost. 

These two theorems are equivalent because we can define a compression algo- 
rithm that gives a distinct name of length NH(X) bits to each x in the typical 
set. 

► 4.5 Proofs 

This section may be skipped if found tough going. 

The law of large numbers 

Our proof of the source coding theorem uses the law of large numbers. 

Mean and variance of a real random variable are £[u] = u = P(u)u 
and var(u) = a 2 u = £[(u - u) 2 ] = J2 U P(u)(u — u) 2 . 

Technical note: strictly I am assuming here that u is a function u(x) 
of a sample x from a finite discrete ensemble X. Then the summations 
^2 U P(u)f(u) should be written ^ P(x)f(u(x)). This means that P(u) 
is a finite sum of delta functions. This restriction guarantees that the 
mean and variance of u do exist, which is not necessarily the case for 
general P(u). 

Chebyshev's inequality 1. Let t be a non-negative real random variable, 
and let a be a positive real number. Then 

Pit >«)<-. (4.30) 
a 

Proof: P(t > a) = ^2 t>a P(t). We multiply each term by t/a > 1 and 
obtain: P(t > a) < ^2 t > a P(t)t/a. We add the (non-negative) missing 
terms and obtain: P(t > a) < J2t P(t)t/a = t/a. □ 
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Chebyshev's inequality 2. Let x be a random variable, and let a be a 
positive real number. Then 

P ((x - xf >a)< alja. (4.31) 
Proof: Take t = (x — x) 2 and apply the previous proposition. □ 

Weak law of large numbers. Take x to be the average of N independent 
random variables hi, . . . , /ijv, having common mean h and common vari- 
ance a\: x = X)n=i h n- Then 

P((x - hf >a)< <J 2 h /aN. (4.32) 

Proof: obtained by showing that x = h and that a 2 , = a\jN . □ 

We are interested in x being very close to the mean (a very small). No matter 
how large a\ is, and no matter how small the required a is, and no matter 
how small the desired probability that (a; — h) 2 > a, we can always achieve it 
by taking N large enough. 



Proof of theorem 4-1 (p-78) 

We apply the law of large numbers to the random variable log 2 -p^y defined 
for x drawn from the ensemble X N . This random variable can be written as 
the average of N information contents h n = log 2 (l/ P(x n )), each of which is a 
random variable with mean H = H(X) and variance a 2 = var[log 2 (l/P(x n ))]. 
(Each term h n is the Shannon information content of the nth outcome.) 
We again define the typical set with parameters N and (3 thus: 

1 2 "1 



'-N0 



1 



P(x) 



H 



< (3 2 



For all x G T^p, the probability of x satisfies 

2 -N(H+P) <p(x) <2 -N(H-p)^ 

And by the law of large numbers, 



P{* e T Nf3 ) > l 



a 



(4.33) 



(4.34) 



(4.35) 



We have thus proved the 'asymptotic equipartition' principle. As increases, 
the probability that x falls in Tj^p approaches 1, for any (3. How does this 
result relate to source coding? 

We must relate Tiqp to H$(X N ). We will show that for any given 5 there 
is a sufficiently big N such that H S (X N ) ~ NH. 

Part 1: jjH s {X N ) < H + e. 

The set is not the best subset for compression. So the size of T N p gives 
an upper bound on H5. We show how small Hg(X N ) must be by calculating 
how big T^p could possibly be. We are free to set (3 to any convenient value. 
The smallest possible probability that a member of T^p can have is 2~ N ( H+I3 \ 
and the total probability that T^p contains can't be any bigger than 1. So 

\T N p\2- N ^<l, 



that is, the size of the typical set is bounded by 

\T Nf3 \ <2 N ( H +P\ 



(4.36) 



(4.37) 



If we set (3 = e and Nq such that < 5, then P(Tj^p) > 1 — 8, and the set 
Tnp becomes a witness to the fact that H${X N ) < log 2 \T^p\ < N(H + e). 



H (X) 



rH + e 
t 

\ 

- H - e 



1 8 

Figure 4.13. Schematic illustration 
of the two parts of the theorem. 
Given any <5 and e, we show that 
for large enough N, ±H S (X N ) 
lies (1) below the line H + e and 
(2) above the line H — e. 
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Part 2: jjH s (X N ) > H - e. 

Imagine that someone claims this second part is not so - that, for any 7V, 
the smallest <5-sufficient subset Sg is smaller than the above inequality would 
allow. We can make use of our typical set to show that they must be mistaken. 
Remember that we are free to set (3 to any value we choose. We will set 
(3 = e/2, so that our task is to prove that a subset S' having |S"| < 2 N ( H ~ 2 ^ 
and achieving P(x € S r ) > 1 — 5 cannot exist (for TV greater than an TVq that 
we will specify). 

So, let us consider the probability of falling in this rival smaller subset S' . ^np S /\ \.S' 
The probability of the subset S' is V V w V 

p(x e S') = P(x e S'nT NI3 ) + P(x e S'nT N/3 ), (4.38) s , \ T ^ 



where Tjqp denotes the complement {x ^ P/v/3}. The maximum value of 
the first term is found if S' fl T N p contains 2 N ( H ~ 213 } outcomes all with the 
maximum probability, 2~ N ( H ~@\ The maximum value the second term can 
have is P(x T N p). So: 

P(x e S') < 2 N ^- 2 ^ 2- N ^ + ^ = 2"^ + (4.39) 

We can now set (3 = e/2 and TV such that P(x e S') < 1 — 5, which shows 
that S' cannot satisfy the definition of a sufficient subset Sg. Thus any subset 
S' with size \S'\ < 2 Ar (- ff ~ e ) has probability less than 1 — 5, so by the definition 
of Hg, H S (X N ) > N(H - e). 

Thus for large enough TV, the function jjHg(X N ) is essentially a constant 
function of 5, for < S < 1, as illustrated in figures 4.9 and 4.13. □ 



► 4.6 Comments 

The source coding theorem (p. 78) has two parts, jjHg(X N ) < H + e, and 
j r H s {X N ) > H-e. Both results are interesting. 

The first part tells us that even if the probability of error 5 is extremely 
small, the number of bits per symbol jfHg(X N ) needed to specify a long 
TV-symbol string x with vanishingly small error probability does not have to 
exceed H + e bits. We need to have only a tiny tolerance for error, and the 
number of bits required drops significantly from Hq(X) to (H + e). 

What happens if we are yet more tolerant to compression errors? Part 2 
tells us that even if 6 is very close to 1, so that errors are made most of the 
time, the average number of bits per symbol needed to specify x must still be 
at least H — e bits. These two extremes tell us that regardless of our specific 
allowance for error, the number of bits per symbol needed to specify x is H 
bits; no more and no less. 



Caveat regarding 'asymptotic equipartition' 

I put the words 'asymptotic equipartition' in quotes because it is important 
not to think that the elements of the typical set T^p really do have roughly 
the same probability as each other. They are similar in probability only in 
the sense that their values of log 2 -p^y are within 2TV/3 of each other. Now, as 
P is decreased, how does TV have to increase, if we are to keep our bound on 

2 

the mass of the typical set, P(x € Pjv/3) > 1 — constant? TV must grow 
as 1//3 2 , so, if we write (3 in terms of TV as a/y/N, for some constant a, then 
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the most probable string in the typical set will be of order 2 Q ^ times greater 
than the least probable string in the typical set. As f3 decreases, TV increases, 
and this ratio grows exponentially. Thus we have 'equipartition' only in 

a weak sense! 



Why did we introduce the typical set? 

The best choice of subset for block compression is (by definition) Sg, not a 
typical set. So why did we bother introducing the typical set? The answer is, 
we can count the typical set. We know that all its elements have 'almost iden- 
tical' probability (2~ ), and we know the whole set has probability almost 
1, so the typical set must have roughly 2 NH elements. Without the help of 
the typical set (which is very similar to Sg) it would have been hard to count 
how many elements there are in Sg. 



► 4.7 Exercises 

Weighing problems 

> Exercise 4.9. While some people, when they first encounter the weighing 

problem with 12 balls and the three-outcome balance (exercise 4.1 
(p. 66)), think that weighing six balls against six balls is a good first 
weighing, others say 'no, weighing six against six conveys no informa- 
tion at all'. Explain to the second group why they are both right and 
wrong. Compute the information gained about which is the odd ball, 
and the information gained about which is the odd ball and whether it is 
heavy or light. 

> Exercise 4.10.^ Solve the weighing problem for the case where there are 39 

balls of which one is known to be odd. 

> Exercise 4.11.^ You are given 16 balls, all of which are equal in weight except 

for one that is either heavier or lighter. You are also given a bizarre two- 
pan balance that can report only two outcomes: 'the two sides balance' 
or 'the two sides do not balance'. Design a strategy to determine which 
is the odd ball in as few uses of the balance as possible. 

> Exercise 4.12 J 2 ' You have a two-pan balance; your job is to weigh out bags of 

flour with integer weights 1 to 40 pounds inclusive. How many weights 
do you need? [You are allowed to put weights on either pan. You're only 
allowed to put one flour bag on the balance at a time.] 

Exercise 4.13.^' p ' 86 -' (a) Is it possible to solve exercise 4.1 (p. 66) (the weigh- 
ing problem with 12 balls and the three-outcome balance) using a 
sequence of three fixed weighings, such that the balls chosen for the 
second weighing do not depend on the outcome of the first, and the 
third weighing does not depend on the first or second? 

(b) Find a solution to the general TV-ball weighing problem in which 
exactly one of TV balls is odd. Show that in W weighings, an odd 
ball can be identified from among TV = (3 W — 3)/2 balls. 

bxercise 4.14. [3] You are given 12 balls and the three-outcome balance of exer- 
cise 4.1; this time, two of the balls are odd; each odd ball may be heavy 
or light, and we don't know which. We want to identify the odd balls 
and in which direction they are odd. 
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4.7: Exercises 

(a) Estimate how many weighings are required by the optimal strategy. 
And what if there are three odd balls? 

(b) How do your answers change if it is known that all the regular balls 
weigh 100 g, that light balls weigh 99 g, and heavy ones weigh 110 g? 

Source coding with a lossy compressor, with loss 5 

t> Exercise 4.15. [2 ' p ' 87] Let V x = {0.2,0.8}. Sketch ±H S {X N ) as a function of 
6 for N = 1,2 and 1000. 

c> Exercise 4.16. [2] Let Vy = {0.5,0.5}. Sketch ±H S (Y N ) as a function of S for 
N = 1,2,3 and 100. 

> Exercise 4.17. ^' p ' 8 ^ (For physics students.) Discuss the relationship between 

the proof of the 'asymptotic equipartition' principle and the equivalence 
(for large systems) of the Boltzmann entropy and the Gibbs entropy. 

Distributions that don't obey the law of large numbers 

The law of large numbers, which we used in this chapter, shows that the mean 
of a set of N i.i.d. random variables has a probability distribution that becomes 
narrower, with width oc l/y/~N, as N increases. However, we have proved 
this property only for discrete random variables, that is, for real numbers 
taking on a finite set of possible values. While many random variables with 
continuous probability distributions also satisfy the law of large numbers, there 
are important distributions that do not. Some continuous distributions do not 
have a mean or variance. 

> Exercise 4. 18.^' p ' 8 ^ Sketch the Cauchy distribution 

What is its normalizing constant Z1 Can you evaluate its mean or 
variance? 

Consider the sum z = x\ + X2, where x\ and X2 are independent random 
variables from a Cauchy distribution. What is P(z)t What is the prob- 
ability distribution of the mean of x\ and X2, x = (x\ + X2)/2? What is 
the probability distribution of the mean of TV samples from this Cauchy 
distribution? 

Other asymptotic properties 

Exercise 4.19.'- 5 -' Chernoff bound. We derived the weak law of large numbers 
from Chebyshev's inequality (4.30) by letting the random variable t in 
the inequality P(t > a) < t/a be a function, t = (x — x) 2 , of the random 
variable x we were interested in. 

Other useful inequalities can be obtained by using other functions. The 
Chernoff bound, which is useful for bounding the tails of a distribution, 
is obtained by letting t = exp(sx). 

Show that 

P(x >a)< e-^gis), for any s > (4.41) 

and 

P(x < a) < e~ sa g{s), for any s < (4.42) 
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where g(s) is the moment-generating function of x, 

g(s) = J2P(x)e SX - (4-43) 

X 

Curious functions related to plogl/p 

Exercise 4.20.^' p ' 89 ^ This exercise has no purpose at all; it's included for the 
enjoyment of those who like mathematical curiosities. 

Sketch the function 

f(x) = (4.44) 

for x > 0. Hint: Work out the inverse function to / - that is, the function 
g(y) such that if x = g(y) then y = f(x) - it's closely related to plogl/p. 

► 4.8 Solutions 

Solution to exercise 4.2 (p. 68). Let P{x,y) = P(x)P(y). Then 

H(X,Y) = J2 P (x)P(v)log p{ J )p{y) (4.45) 
= £p(z)Pfo)log^ + £p(z)P(y)tog^ (4.46) 

xy ^ 1 xy ^ 1 

= E^) lo gp^y + E p (y) lo gj4) (447) 

= H(X) + H(Y). (4.48) 

Solution to exercise 4.4 (p. 73). An ASCII file can be reduced in size by a 
factor of 7/8. This reduction could be achieved by a block code that maps 
8-byte blocks into 7-byte blocks by copying the 56 information-carrying bits 
into 7 bytes, and ignoring the last bit of every character. 

Solution to exercise 4.5 (p. 74). The pigeon-hole principle states: you can't 
put 16 pigeons into 15 holes without using one of the holes twice. 

Similarly, you can't give Ax outcomes unique binary names of some length 
I shorter than log 2 \Ax\ bits, because there are only 2 l such binary names, 
and / < log 2 \Ax\ implies 2 l < \Ax\, so at least two different inputs to the 
compressor would compress to the same output file. 

Solution to exercise 4.8 (p. 76). Between the cusps, all the changes in proba- 
bility are equal, and the number of elements in T changes by one at each step. 
So H$ varies logarithmically with (—5). 

Solution to exercise 4.13 (p. 84). This solution was found by Dyson and Lyness 
in 1946 and presented in the following elegant form by John Conway in 1999. 
Be warned: the symbols A, B, and C are used to name the balls, to name the 
pans of the balance, to name the outcomes, and to name the possible states 
of the odd ball! 

(a) Label the 12 balls by the sequences 

AAB ABA ABB ABC BBC BCA BCB BCC CAA CAB CAC CCA 



and in the 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



4.8: Solutions 87 

1st AAB ABA ABB ABC BBC BCA BCB BCC 

2nd weighings put AAB CAA CAB CAC in pan A, ABA ABB ABC BBC in pan B. 

3rd ABA BCA CAA CCA AAB ABB BCB CAB 

Now in a given weighing, a pan will either end up in the 

• Canonical position (C) that it assumes when the pans are balanced, 
or 

• Above that position (A), or 

• Below it (B), 

so the three weighings determine for each pan a sequence of three of 
these letters. 

If both sequences are CCC, then there's no odd ball. Otherwise, for just 
one of the two pans, the sequence is among the 12 above, and names 
the odd ball, whose weight is Above or Below the proper one according 
as the pan is A or B. 

(b) In W weighings the odd ball can be identified from among 

AT=(3 w -3)/2 (4.49) 

balls in the same way, by labelling them with all the non-constant se- 
quences of W letters from A, B, C whose first change is A-to-B or B-to-C 
or C-to-A, and at the wth weighing putting those whose wth letter is A 
in pan A and those whose wth letter is B in pan B. 

Solution to exercise 4.15 (p. 85). The curves jjH$(X N ) as a function of 5 for 
N = 1,2 and 1000 are shown in figure 4.14. Note that H 2 {0.2) = 0.72 bits. 
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Solution to exercise 4.17 (p. 85). The Gibbs entropy is fcBX^Pi m ^"' where i 
runs over all states of the system. This entropy is equivalent (apart from the 
factor of &b) to the Shannon entropy of the ensemble. 

Whereas the Gibbs entropy can be defined for any ensemble, the Boltz- 
mann entropy is only defined for microcanonical ensembles, which have a 
probability distribution that is uniform over a set of accessible states. The 
Boltzmann entropy is defined to be Sq = fee InCl where £7 is the number of ac- 
cessible states of the microcanonical ensemble. This is equivalent (apart from 
the factor of fee) to the perfect information content Ho of that constrained 
ensemble. The Gibbs entropy of a microcanonical ensemble is trivially equal 
to the Boltzmann entropy. 



Figure 4.14. j^Hs(X) (vertical 
axis) against 5 (horizontal), for 
N = 1,2, 100 binary variables 
with pi = 0.4. 
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88 4 — The Source Coding Theorem 

We now consider a thermal distribution (the canonical ensemble), where 
the probability of a state x is 

P(x) = -Lxp f-fMV (4.50) 



Z r \ k B T , 

With this canonical ensemble we can associate a corresponding microcanonical 
ensemble, an ensemble with total energy fixed to the mean energy of the 
canonical ensemble (fixed to within some precision e). Now, fixing the total 
energy to a precision e is equivalent to fixing the value of In V-P(x) to within 
ek^T. Our definition of the typical set Tjqp was precisely that it consisted 
of all elements that have a value of log P(x) very close to the mean value of 
logP(x) under the canonical ensemble, —NH{X). Thus the microcanonical 
ensemble is equivalent to a uniform distribution over the typical set of the 
canonical ensemble. 

Our proof of the 'asymptotic equipartition' principle thus proves - for the 
case of a system whose energy is separable into a sum of independent terms 
- that the Boltzmann entropy of the microcanonical ensemble is very close 
(for large N) to the Gibbs entropy of the canonical ensemble, if the energy of 
the microcanonical ensemble is constrained to equal the mean energy of the 
canonical ensemble. 

Solution to exercise 4.18 (p. 85). The normalizing constant of the Cauchy dis- 
tribution 

is 



f 



dx -J— = [tan- 1 ^ °° = - - — = 7T. (4.51) 
x 2 + l 1 J -°° 2 2 v ; 



The mean and variance of this distribution are both undefined. (The distribu- 
tion is symmetrical about zero, but this does not imply that its mean is zero. 
The mean is the value of a divergent integral.) The sum z = X\ + x 2 , where 
x\ and X2 both have Cauchy distributions, has probability density given by 
the convolution 

which after a considerable labour using standard methods gives 

which we recognize as a Cauchy distribution with width parameter 2 (where 
the original distribution has width parameter 1). This implies that the mean 
of the two points, x = (xi + X2)/2 = z/2, has a Cauchy distribution with 
width parameter 1. Generalizing, the mean of N samples from a Cauchy 
distribution is Cauchy-distributed with the same parameters as the individual 
samples. The probability distribution of the mean does not become narrower 
as 1/VN. 

The central-limit theorem does not apply to the Cauchy distribution, be- 
cause it does not have a finite variance. 

An alternative neat method for getting to equation (4.53) makes use of the 
Fourier transform of the Cauchy distribution, which is a biexponential e - '"'. 
Convolution in real space corresponds to multiplication in Fourier space, so 
the Fourier transform of z is simply e~l 2w . Reversing the transform, we obtain 
equation (4.53). 
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4.8: Solutions 



Solution to exercise 4.20 (p. 86). The function f(x) has inverse function 

g(y) = y l 'y. (4.54) 



Note 



logff(y) = 1/f/logy. 



(4.55) 



I obtained a tentative graph of f(x) by plotting g(y) with y along the vertical 
axis and g(y) along the horizontal axis. The resulting graph suggests that 
f(x) is single valued for x £ (0,1), and looks surprisingly well-behaved and 
ordinary; for x £ (l,e l / e ), f(x) is two-valued. f(V^) is equal both to 2 and 
4. For x > e 1 / 6 (which is about 1.44), f(x) is infinite. However, it might be 
argued that this approach to sketching f(x) is only partly valid, if we define / 
as the limit of the sequence of functions X , X , X ; ■ ■ 

.; this sequence does not 

have a limit for < x < (l/e) e ~ 0.07 on account of a pitchfork bifurcation 
at x = (l/e) e ; and for x £ (1, e 1,/e ), the sequence's limit is single-valued - the 
lower of the two values sketched in the figure. 




Figure 4.15. f(x) = 

at three different scales. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



About Chapter 5 



In the last chapter, we saw a proof of the fundamental status of the entropy 
as a measure of average information content. We defined a data compression 
scheme using fixed length block codes, and proved that as N increases, it is 
possible to encode N i.i.d. variables x = {x\, . . . , xjq) into a block of N(H(X)+ 
e) bits with vanishing probability of error, whereas if we attempt to encode 
X N into N(H(X) — e) bits, the probability of error is virtually 1. 

We thus verified the possibility of data compression, but the block coding 
defined in the proof did not give a practical algorithm. In this chapter and 
the next, we study practical data compression algorithms. Whereas the last 
chapter's compression scheme used large blocks of fixed size and was lossy, 
in the next chapter we discuss variable-length compression schemes that are 
practical for small block sizes and that are not lossy. 

Imagine a rubber glove filled with water. If we compress two fingers of the 
glove, some other part of the glove has to expand, because the total volume 
of water is constant. (Water is essentially incompressible.) Similarly, when 
we shorten the codewords for some outcomes, there must be other codewords 
that get longer, if the scheme is not lossy. In this chapter we will discover the 
information-theoretic equivalent of water volume. 

Before reading Chapter 5, you should have worked on exercise 2.26 (p. 37). 

We will use the following notation for intervals: 

x £ [1, 2) means that x > 1 and x < 2; 
x £ (1, 2] means that x > 1 and x < 2. 
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5 



Symbol Codes 



In this chapter, we discuss variable-length symbol codes, which encode one 
source symbol at a time, instead of encoding huge strings of N source sym- 
bols. These codes are lossless: unlike the last chapter's block codes, they are 
guaranteed to compress and decompress without any errors; but there is a 
chance that the codes may sometimes produce encoded strings longer than 
the original source string. 

The idea is that we can achieve compression, on average, by assigning 
shorter encodings to the more probable outcomes and longer encodings to the 
less probable. 

The key issues are: 

What are the implications if a symbol code is lossless? If some code- 
words are shortened, by how much do other codewords have to be length- 



Making compression practical. How can we ensure that a symbol code is 
easy to decode? 

Optimal symbol codes. How should we assign codelengths to achieve the 
best compression, and what is the best achievable compression? 

We again verify the fundamental status of the Shannon information content 
and the entropy, proving: 

Source coding theorem (symbol codes). There exists a variable-length 
encoding C of an ensemble X such that the average length of an en- 
coded symbol, L(C,X), satisfies L(C,X) E [H(X),H(X) + 1). 

The average length is equal to the entropy H{X) only if the codelength 
for each outcome is equal to its Shannon information content. 

We will also define a constructive procedure, the Huffman coding algorithm, 
that produces optimal symbol codes. 

Notation for alphabets. A N denotes the set of ordered iV-tuples of ele- 
ments from the set A, i.e., all strings of length N. The symbol A + will 
denote the set of all strings of finite length composed of elements from 
the set A. 



ened? 



Example 5.1. {0, l} 3 



{000,001,010,011, 100, 101, 110, 111}. 



Example 5.2. {0, 1} 



+ 



{0,1,00,01,10,11 



000,001,...}. 
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92 5 — Symbol Codes 



► 5.1 Symbol codes 

A (binary) symbol code C for an ensemble X is a mapping from the range 
of x, Ax = • • • i o-i}, to {0, l} + . c(x) will denote the codeword cor- 
responding to x, and l(x) will denote its length, with l{ = l(a,{). 

The extended code C + is a mapping from A x to {0, l} + obtained by 
concatenation, without punctuation, of the corresponding codewords: 

c + (xix 2 . . . x N ) = c(xi)c(x 2 ) ■ ■ ■ c(x N ). (5.1) 



[The term 'mapping' here is a synonym for 'function'.] 

Example 5.3. A symbol code for the ensemble X defined by 

Ax = {a, b, c, d }, 
T x = {V2, 1/4, Vs, 1/8}, 

is Co, shown in the margin. 

Using the extended code, we may encode acdbac as 

c+(acdbac) = 100000100001010010000010. 



(5.2) 



(5.3) 



C : 



di 


C(a,i) 


h 


a 


1000 


4 


b 


0100 


4 


c 


0010 


4 


d 


0001 


4 



There are basic requirements for a useful symbol code. First, any encoded 
string must have a unique decoding. Second, the symbol code must be easy to 
decode. And third, the code should achieve as much compression as possible. 



Any encoded string must have a unique decoding 

A code C(X) is uniquely decodeable if, under the extended code C + , no 
two distinct strings have the same encoding, i.e., 

Vxje4, x^y c+(x)^c+(y). (5.4) 
The code Co defined above is an example of a uniquely decodeable code. 
The symbol code must be easy to decode 

A symbol code is easiest to decode if it is possible to identify the end of a 
codeword as soon as it arrives, which means that no codeword can be a prefix 
of another codeword. [A word c is a prefix of another word d if there exists a 
tail string t such that the concatenation ct is identical to d. For example, 1 is 
a prefix of 101, and so is 10.] 

We will show later that we don't lose any performance if we constrain our 
symbol code to be a prefix code. 

A symbol code is called a prefix code if no codeword is a prefix of any 
other codeword. 

A prefix code is also known as an instantaneous or self-punctuating code, 
because an encoded string can be decoded from left to right without 
looking ahead to subsequent codewords. The end of a codeword is im- 
mediately recognizable. A prefix code is uniquely decodeable. 

Prefix codes are also known as 'prefix-free codes' or 'prefix condition codes'. 



Prefix codes correspond to trees. 
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Example 5.4. The code C\ = {0, 101} is a prefix code because is not a prefix 
of 101, nor is 101 a prefix of 0. 

Example 5.5. Let C2 = {1, 101}. This code is not a prefix code because 1 is a 
prefix of 101. 

Example 5.6. The code C3 = {0, 10, 110, 111} is a prefix code. 
Example 5.7. The code C4 = {00, 01, 10, 11} is a prefix code. 
Exercise 5.8.^' p ' 104 l Is C2 uniquely decodeable? 

Example 5.9. Consider exercise 4.1 (p. 66) and figure 4.2 (p. 69). Any weighing 
strategy that identifies the odd ball and whether it is heavy or light can 
be viewed as assigning a ternary code to each of the 24 possible states. 
This code is a prefix code. 

The code should achieve as much compression as possible 

The expected length L(C,X) of a symbol code C for ensemble X is 



L(C,X)= P{x)l(x). 

x£A x 

We may also write this quantity as 

/ 

L{C,X)=Y J Pik 



(5.5) 



(5.6) 



i=l 



where / = \Ax\ 
Example 5.10. Let 



and 



Ax 
V x 



{ a, b, c, d }, 

{V2,V4,V8,V8}, 



(5.7) 



and consider the code C3. The entropy of X is 1.75 bits, and the expected 
length L(Cs,X) of this code is also 1.75 bits. The sequence of symbols 
x= (acdbac) is encoded as c + (x) = 0110111100110. C3 is a prefix code 
and is therefore uniquely decodeable. Notice that the codeword lengths 
satisfy k= log 2 (l/pj), or equivalently, pi = 2~ li . 

Example 5.11. Consider the fixed length code for the same ensemble X, C4. 
The expected length L(Ci,X) is 2 bits. 

Example 5.12. Consider C 5 . The expected length L(C§,X) is 1.25 bits, which 
is less than H(X). But the code is not uniquely decodeable. The se- 
quence x= (acdbac) encodes as 000111000, which can also be decoded 
as (cabdca). 

Example 5.13. Consider the code Cq. The expected length L(Cq,X) of this 
code is 1.75 bits. The sequence of symbols x= (acdbac) is encoded as 
c+(x) = 0011111010011. 

Is Cq a prefix code? It is not, because c(a) = is a prefix of both c(b) 
and c(c). 



Ci 







i' x 101 




Prefix codes can be represented 
on binary trees. Complete prefix 
codes correspond to binary trees 
with no unused branches. C\ is an 
incomplete code. 
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C 6 : 






a; 
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Is Cq uniquely decodeable? This is not so obvious. If you think that it 
might not be uniquely decodeable, try to prove it so by finding a pair of 
strings x and y that have the same encoding. [The definition of unique 
decodeability is given in equation (5.4).] 

Cq certainly isn't easy to decode. When we receive '00', it is possible 
that x could start 'aa', 'ab' or 'ac'. Once we have received '001111', 
the second symbol is still ambiguous, as x could be 'abd. . . ' or 'acd. . . '. 
But eventually a unique decoding crystallizes, once the next appears 
in the encoded stream. 

Cq is in fact uniquely decodeable. Comparing with the prefix code C3, 
we see that the codewords of Cq are the reverse of C3S. That C 3 is 
uniquely decodeable proves that Cq is too, since any string from Cq is 
identical to a string from C3 read backwards. 



► 5.2 What limit is imposed by unique decodeability? 

We now ask, given a list of positive integers {li}, does there exist a uniquely 
decodeable code with those integers as its codeword lengths? At this stage, we 
ignore the probabilities of the different symbols; once we understand unique 
decodeability better, we'll reintroduce the probabilities and discuss how to 
make an optimal uniquely decodeable symbol code. 

In the examples above, we have observed that if we take a code such as 
{00,01, 10, 11}, and shorten one of its codewords, for example 00 — > 0, then 
we can retain unique decodeability only if we lengthen other codewords. Thus 
there seems to be a constrained budget that we can spend on codewords, with 
shorter codewords being more expensive. 

Let us explore the nature of this budget. If we build a code purely from 
codewords of length I equal to three, how many codewords can we have and 
retain unique decodeability? The answer is 2 l = 8. Once we have chosen all 
eight of these codewords, is there any way we could add to the code another 
codeword of some other length and retain unique decodeability? It would 
seem not. 

What if we make a code that includes a length-one codeword, '0', with the 
other codewords being of length three? How many length-three codewords can 
we have? If we restrict attention to prefix codes, then we can have only four 
codewords of length three, namely {100,101,110,111}. What about other 
codes? Is there any other way of choosing codewords of length 3 that can give 
more codewords? Intuitively, we think this unlikely. A codeword of length 3 
appears to have a cost that is 2 2 times smaller than a codeword of length 1. 

Let's define a total budget of size 1, which we can spend on codewords. If 
we set the cost of a codeword whose length is I to 2~', then we have a pricing 
system that fits the examples discussed above. Codewords of length 3 cost 
V% each; codewords of length 1 cost 1/2 each. We can spend our budget on 
any codewords. If we go over our budget then the code will certainly not be 
uniquely decodeable. If, on the other hand, 



then the code may be uniquely decodeable. This inequality is the Kraft in- 
equality. 




(5.8) 



Kraft inequality. For any uniquely decodeable code C(X) over the binary 
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5.2: What limit is imposed by unique decodeability? 

alphabet {0, 1}, the codeword lengths must satisfy: 

I 

J2^~ k <1, (5-9) 

i=i 

where I = \Ax\- 

Completeness. If a uniquely decodeable code satisfies the Kraft inequality 
with equality then it is called a complete code. 

We want codes that are uniquely decodeable; prefix codes are uniquely de- 
codeable, and are easy to decode. So life would be simpler for us if we could 
restrict attention to prefix codes. Fortunately, for any source there is an op- 
timal symbol code that is also a prefix code. 

Kraft inequality and prefix codes. Given a set of codeword lengths that 
satisfy the Kraft inequality, there exists a uniquely decodeable prefix 
code with these codeword lengths. 

The Kraft inequality might be more accurately referred to as the Kraft- 
McMillan inequality: Kraft proved that if the inequality is satisfied, then a 
prefix code exists with the given lengths. McMillan (1956) proved the con- 
verse, that unique decodeability implies that the inequality holds. 

Proof of the Kraft inequality. Define S = J2i ■ Consider the quantity 



S N 



E 2 ^ 



EE--E r( ' li + i,2 + '"' SA ' ) - ( 5 - 10 ) 

tl = lt2 = l »JV=1 



The quantity in the exponent, (7^ + k 2 + ■ ■ ■ + k N ), is the length of the 
encoding of the string x = a^ct^ . . . cn N . For every string x of length N, 
there is one term in the above sum. Introduce an array Ai that counts 
how many strings x have encoded length I. Then, defining Z m ; n = min^ ij 
and l max = maxj \^\ 

^max 

S N = 2 ~ lA i- ( 5 - n ) 

I — ^V7 mm 

Now assume C is uniquely decodeable, so that for all x/y, c+(x) ^ 
c + (y). Concentrate on the x that have encoded length I. There are a 
total of 2 l distinct bit strings of length I, so it must be the case that 
Ai < 2 l . So 



=OT min l=Nl 



£ 2-%< g 1 < Nl max . (5.12) 



Thus S N < ImaxN for all N. Now if S were greater than 1, then as N 
increases, S N would be an exponentially growing function, and for large 
enough N, an exponential always exceeds a polynomial such as Ima^N. 
But our result (S N < l max N) is true for any N. Therefore S < 1. □ 

> Exercise 5.14J 3 ' P- 104 l Prove the result stated above, that for any set of code- 
word lengths {li} satisfying the Kraft inequality, there is a prefix code 
having those lengths. 
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Figure 5.1. The symbol coding 
budget. The 'cost' 2~' of each 
codeword (with length I) is 
indicated by the size of the box it 
is written in. The total budget 
available when making a uniquely 
dccodeable code is 1. 
You can think of this diagram as 
showing a codeword supermarket, 
with the codewords arranged in 
aisles by their length, and the cost 
of each codeword indicated by the 
size of its box on the shelf. If the 
cost of the codewords that you 
take exceeds the budget then your 
code will not be uniquely 
dccodeable. 



C 



C 3 




C 6 



Figure 5.2. Selections of 
codewords made by codes 
Co, C3, C4 and C% from section 
5.1. 
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5.3: What's the most compression that we can hope for? 

A pictorial view of the Kraft inequality may help you solve this exercise. 
Imagine that we are choosing the codewords to make a symbol code. We can 
draw the set of all candidate codewords in a supermarket that displays the 
'cost' of the codeword by the area of a box (figure 5.1). The total budget 
available - the '1' on the right-hand side of the Kraft inequality - is shown at 
one side. Some of the codes discussed in section 5.1 are illustrated in figure 
5.2. Notice that the codes that are prefix codes, Co, C3, and C4, have the 
property that to the right of any selected codeword, there are no other selected 
codewords - because prefix codes correspond to trees. Notice that a complete 
prefix code corresponds to a complete tree having no unused branches. 

We are now ready to put back the symbols' probabilities {pi}. Given a 
set of symbol probabilities (the English language probabilities of figure 2.1, 
for example), how do we make the best symbol code - one with the smallest 
possible expected length L(C, X)? And what is that smallest possible expected 
length? It's not obvious how to assign the codeword lengths. If we give short 
codewords to the more probable symbols then the expected length might be 
reduced; on the other hand, shortening some codewords necessarily causes 
others to lengthen, by the Kraft inequality. 

► 5.3 What's the most compression that we can hope for? 

We wish to minimize the expected length of a code, 



As you might have guessed, the entropy appears as the lower bound on the 
expected length of a code. 

Lower bound on expected length. The expected length L(C,X) of a 
uniquely decodeable code is bounded below by H(X). 

Proof. We define the implicit probabilities qi = 2~ li / z, where z= 2 _ '«', so 
that U = log l/qi — log z. We then use Gibbs' inequality, J2iPi l°g Vft ^ 
X^Pilogl/pi, with equality if qi=Pi, and the Kraft inequality z < 1: 




(5.13) 



L(C,X) 



^Pik = ^2 / Pi^ogl/q l - logz 



(5.14) 




(5.15) 



> H(X). (5.16) 



The equality L(C, X) = H(X) is achieved only if the Kraft equality z = l 
is satisfied, and if the codelengths satisfy k = log(l/pj). □ 



This is an important result so let's say it again: 



Optimal source codelengths. The expected length is minimized and is 
equal to H{X) only if the codelengths are equal to the Shannon in- 
formation contents: 



k = log 2 (l/ Pi ). (5.17) 
Implicit probabilities defined by codelengths. Conversely, any choice 



of codelengths {li} implicitly defines a probability distribution {qi}, 

qi = 2~ l */z, (5.18) 



for which those codelengths would be the optimal codelengths. If the 
code is complete then z = 1 and the implicit probabilities are given by 
q, = 2- l >. 
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98 5 — Symbol Codes 

► 5.4 How much can we compress? 

So, we can't compress below the entropy. How close can we expect to get to 
the entropy? 

Theorem 5.1 Source coding theorem for symbol codes. For an ensemble X 
there exists a prefix code C with expected length satisfying 

H(X) < L(C, X) < H(X) + 1. (5.19) 



Proof. We set the codelengths to integers slightly larger than the optimum 
lengths: 

k = riog 2 (lM)l (5.20) 

where \l*~\ denotes the smallest integer greater than or equal to I*. [We 
are not asserting that the optimal code necessarily uses these lengths, 
we are simply choosing these lengths because we can use them to prove 
the theorem.] 

We check that there is a prefix code with these lengths by confirming 
that the Kraft inequality is satisfied. 



log 2 (l/Pi) 



(5.21) 



Then we confirm 

L(C,X) = ^ Pi riog(l/ Pl )l < X>(log(l/pO + l) = H(X) + 1. (5.22) 

i i 

□ 

The cost of using the wrong codelengths 

If we use a code whose lengths are not equal to the optimal codelengths, the 
average message length will be larger than the entropy. 

If the true probabilities are {pi} and we use a complete code with lengths 
li, we can view those lengths as defining implicit probabilities qi = 2~ li . Con- 
tinuing from equation (5.14), the average length is 



L(C,X) = H(X) 



Pilogpi/qi, 



(5.23) 



i.e., it exceeds the entropy by the relative entropy Z)KL(p||q) (as defined on 
p.34). 

► 5.5 Optimal source coding with symbol codes: Huffman coding 

Given a set of probabilities V, how can we design an optimal prefix code? 
For example, what is the best symbol code for the English language ensemble 
shown in figure 5.3? When we say 'optimal', let's assume our aim is to 
minimize the expected length L(C,X). 
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0.0508 
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0.0567 
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0.0706 
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0.0334 
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0.0069 


w 


0.0119 
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0.0073 
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0.0164 




1 0.0007 




j 0.1928 



Figure 5.3. An ensemble in need of 
a symbol code. 



How not to do it 

One might try to roughly split the set Ax in two, and continue bisecting the 
subsets so as to define a binary tree from the root. This construction has the 
right spirit, as in the weighing problem, but it is not necessarily optimal; it 
achieves L(C,X) < H(X) + 2. 
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5.5: Optimal source coding with symbol codes: Huffman coding 
The Huffman coding algorithm 

We now present a beautifully simple algorithm for finding an optimal prefix 
code. The trick is to construct the code backwards starting from the tails of 
the codewords; we build the binary tree from its leaves. 



99 



1. Take the two least probable symbols in the alphabet. These two 
symbols will be given the longest codewords, which will have equal 
length, and differ only in the last digit. 

2. Combine these two symbols into a single symbol, and repeat. 



Algorithm 5.4. Huffman coding 
algorithm. 



Since each step reduces the size of the alphabet by one, this algorithm will 
have assigned strings to all the symbols after \Ax\ — 1 steps. 

Example 5.15. Let Ax = { a, b, c, d, e } 
and V x = { 0.25, 0.25, 0.2, 0.15, 0.15 }. 



step 1 step 2 step 3 step 4 
0.25 0.25- 





0.25 — 7 0.55^ 1.0 



a 

b 0.25 0.25 — 0.45 ~T 0. 45 ' 1 

c 
d 
e 

The codewords are then obtained by concatenating the binary digits in 
reverse order: C = {00,10,11,010,011}. The codelengths selected 
by the Huffman algorithm (column 4 of table 5.5) are in some cases 
longer and in some cases shorter than the ideal codelengths, the Shannon 
information contents log 2 Vpi (column 3). The expected length of the 
code is L = 2.30 bits, whereas the entropy is H = 2.2855 bits. □ 

If at any point there is more than one way of selecting the two least probable 
symbols then the choice may be made in any manner - the expected length of 
the code will not depend on the choice. 

Exercise 5.16. '-^ P- 105 l Prove that there is no better symbol code for a source 
than the Huffman code. 
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10 
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11 


d 


0.15 
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3 


010 
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0.15 


2.7 
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011 



Table 5.5. Code created by the 
Huffman algorithm. 



Example 5.17. We can make a Huffman code for the probability distribution 
over the alphabet introduced in figure 2.1. The result is shown in fig- 
ure 5.6. This code has an expected length of 4.15 bits; the entropy of 
the ensemble is 4.11 bits. Observe the disparities between the assigned 
codelengths and the ideal codelengths log 2 Vj>». 



Constructing a binary tree top-down is suboptimal 

In previous chapters we studied weighing problems in which we built ternary 
or binary trees. We noticed that balanced trees - ones in which, at every step, 
the two possible outcomes were as close as possible to equiprobable - appeared 
to describe the most efficient experiments. This gave an intuitive motivation 
for entropy as a measure of information content. 
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Figure 5.6. Huffman code for the 
English language ensemble 
(monogram statistics). 



It is not the case, however, that optimal codes can always be constructed 
by a greedy top-down method in which the alphabet is successively divided 
into subsets that are as near as possible to equiprobable. 



Example 5.18. Find the optimal binary symbol code for the ensemble: 

Ax = { a, b, c, d, e, f , g } 
V x = {0.01,0.24,0.05,0.20,0.47,0.01, 0.02} ' 



(5.24) 



Notice that a greedy top-down method can split this set into two sub- 
sets {a, b, c,d} and {e,f,g} which both have probability 1/2, and that 
{a, b, c, d} can be divided into subsets {a, b} and {c, d}, which have prob- 
ability 1/4; so a greedy top-down method gives the code shown in the 
third column of table 5.7, which has expected length 2.53. The Huffman 
coding algorithm yields the code shown in the fourth column, which has 
expected length 1.97. □ 
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000001 
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Table 5.7. A greedily-constructed 
code compared with the Huffman 
code. 



► 5.6 Disadvantages of the Huffman code 

The Huffman algorithm produces an optimal symbol code for an ensemble, 
but this is not the end of the story. Both the word 'ensemble' and the phrase 
'symbol code' need careful attention. 



Changing ensemble 

If we wish to communicate a sequence of outcomes from one unchanging en- 
semble, then a Huffman code may be convenient. But often the appropriate 
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5.6: Disadvantages of the Huffman code 

ensemble changes. If for example we are compressing text, then the symbol 
frequencies will vary with context: in English the letter u is much more prob- 
able after a q than after an e (figure 2.3). And furthermore, our knowledge of 
these context-dependent symbol frequencies will also change as we learn the 
statistical properties of the text source. 

Huffman codes do not handle changing ensemble probabilities with any 
elegance. One brute-force approach would be to recompute the Huffman code 
every time the probability over symbols changes. Another attitude is to deny 
the option of adaptation, and instead run through the entire file in advance 
and compute a good probability distribution, which will then remain fixed 
throughout transmission. The code itself must also be communicated in this 
scenario. Such a technique is not only cumbersome and restrictive, it is also 
suboptimal, since the initial message specifying the code and the document 
itself are partially redundant. This technique therefore wastes bits. 



The extra bit 

An equally serious problem with Huffman codes is the innocuous-looking 'ex- 
tra bit' relative to the ideal average length of H(X) - a Huffman code achieves 
a length that satisfies H{X) < L(C, X) < H(X) + 1, as proved in theorem 5.1. 
A Huffman code thus incurs an overhead of between and 1 bits per symbol. 
If H{X) were large, then this overhead would be an unimportant fractional 
increase. But for many applications, the entropy may be as low as one bit 
per symbol, or even smaller, so the overhead L(C,X) — H(X) may domi- 
nate the encoded file length. Consider English text: in some contexts, long 
strings of characters may be highly predictable. For example, in the context 
'strings_of _ch', one might predict the next nine symbols to be 'aracters_' 
with a probability of 0.99 each. A traditional Huffman code would be obliged 
to use at least one bit per character, making a total cost of nine bits where 
virtually no information is being conveyed (0.13 bits in total, to be precise). 
The entropy of English, given a good model, is about one bit per character 
(Shannon, 1948), so a Huffman code is likely to be highly inefficient. 

A traditional patch-up of Huffman codes uses them to compress blocks of 
symbols, for example the 'extended sources' X N we discussed in Chapter 4. 
The overhead per block is at most 1 bit so the overhead per symbol is at most 
l/N bits. For sufficiently large blocks, the problem of the extra bit may be 
removed - but only at the expenses of (a) losing the elegant instantaneous 
decodeability of simple Huffman coding; and (b) having to compute the prob- 
abilities of all relevant strings and build the associated Huffman tree. One will 
end up explicitly computing the probabilities and codes for a huge number of 
strings, most of which will never actually occur. (See exercise 5.29 (p. 103).) 



Beyond symbol codes 

Huffman codes, therefore, although widely trumpeted as 'optimal', have many 
defects for practical purposes. They are optimal symbol codes, but for practi- 
cal purposes we don't want a symbol code. 

The defects of Huffman codes are rectified by arithmetic coding, which 
dispenses with the restriction that each symbol must translate into an integer 
number of bits. Arithmetic coding is the main topic of the next chapter. 
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► 5.7 Summary 

Kraft inequality. If a code is uniquely decodeable its lengths must satisfy 

^2 _ii <l. (5.25) 

i 

For any lengths satisfying the Kraft inequality, there exists a prefix code 
with those lengths. 

Optimal source codelengths for an ensemble are equal to the Shannon 
information contents 

k = log 2 -, (5.26) 

Pi 

and conversely, any choice of codelengths defines implicit probabilities 

q l = — . (5.27) 

The relative entropy £>KL(p||q) measures how many bits per symbol are 
wasted by using a code whose implicit probabilities are q, when the 
ensemble's true probability distribution is p. 

Source coding theorem for symbol codes. For an ensemble X, there ex- 
ists a prefix code whose expected length satisfies 

H(X) < L{C, X) < H(X) + 1. (5.28) 

The Huffman coding algorithm generates an optimal symbol code itera- 
tively. At each iteration, the two least probable symbols are combined. 



► 5.8 Exercises 

> Exercise 5.19. [2] Is the code {00,11,0101,111,1010,100100,0110} uniquely 

decodeable? 

> Exercise 5.20. ^ Is the ternary code {00,012,0110,0112,100,201,212,22} 

uniquely decodeable? 

4% Exercise 5.21. [3 ' p ' 1061 Make Huffman codes for X 2 , X ? > and X 4 where A x = 
{0,1} and Vx = {0.9,0.1}. Compute their expected lengths and com- 
pare them with the entropies H{X 2 ), H{X 3 ) and H{X 4 ). 

Repeat this exercise for X 2 and X 4 where Vx = {0.6,0.4}. 

^tVj: Exercise 5.22. P- 106 l Find a probability distribution {pi,P2,P3,Pi} such that 
there are two optimal codes that assign different lengths {l{} to the four 
symbols. 

Exercise 5.23. ^ (Continuation of exercise 5.22.) Assume that the four proba- 
bilities {pi,P2,P3,P4} are ordered such that P\> P2> P3 > Pi > 0. Let 
Q be the set of all probability vectors p such that there are two optimal 
codes with different lengths. Give a complete description of Q. Find 
three probability vectors qW, q^ 2 \ q' 3 \ which are the convex hull of Q, 
i.e., such that any p G Q can be written as 

p = /iiq« + /W 2) + /i3q (3) , (5.29) 



where {/z^} are positive. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



5.8: Exercises 

> Exercise 5.24. ^ Write a short essay discussing how to play the game of twenty 

questions optimally. [In twenty questions, one player thinks of an object, 
and the other player has to guess the object using as few binary questions 
as possible, preferably fewer than twenty] 

> Exercise 5.25. ^ Show that, if each probability pi is equal to an integer power 

of 2 then there exists a source code whose expected length equals the 
entropy. 

> Exercise 5.26.^' p l06 l Make ensembles for which the difference between the 

entropy and the expected length of the Huffman code is as big as possible. 

> Exercise 5.27 } 2 ' P- 106 l A source X has an alphabet of eleven characters 

{a,b,c,d,e,f,g,h,i, j,k}, 
all of which have equal probability, 1/11. 

Find an optimal uniquely decodeable symbol code for this source. How 
much greater is the expected length of this optimal code than the entropy 
of XI 

> Exercise 5.28. ^ Consider the optimal symbol code for an ensemble X with 

alphabet size I from which all symbols have identical probability p = 
1/1. I is not a power of 2. 

Show that the fraction / + of the / symbols that are assigned codelengths 
equal to 

Z + =riog 2 Jl (5.30) 

satisfies 

f+ = 2 - — (5.31) 

and that the expected length of the optimal symbol code is 

L = l + -1 + /+. (5.32) 

By differentiating the excess length AL = L — H{X) with respect to I, 
show that the excess length is bounded by 

AL < 1 - - J- = 0.086. (5.33) 

In 2 In 2 V ; 

Exercise 5.29. ^ Consider a sparse binary source with Vx = {0.99, 0.01}. Dis- 
cuss how Huffman codes could be used to compress this source efficiently. 
Estimate how many codewords your proposed solutions require. 

> Exercise 5.30.^ Scientific American carried the following puzzle in 1975. 

The poisoned glass. 'Mathematicians are curious birds', the police 
commissioner said to his wife. 'You see, we had all those partly 
filled glasses lined up in rows on a table in the hotel kitchen. Only 
one contained poison, and we wanted to know which one before 
searching that glass for fingerprints. Our lab could test the liquid 
in each glass, but the tests take time and money, so we wanted to 
make as few of them as possible by simultaneously testing mixtures 
of small samples from groups of glasses. The university sent over a 
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mathematics professor to help us. He counted the glasses, smiled 
and said: 

' "Pick any glass you want, Commissioner. We'll test it first." 
' "But won't that waste a test?" I asked. 

' "No," he said, "it's part of the best procedure. We can test one 
glass first. It doesn't matter which one." ' 

'How many glasses were there to start with?' the commissioner's 
wife asked. 

'I don't remember. Somewhere between 100 and 200.' 
What was the exact number of glasses? 

Solve this puzzle and then explain why the professor was in fact wrong 
and the commissioner was right. What is in fact the optimal procedure 
for identifying the one poisoned glass? What is the expected waste 
relative to this optimum if one followed the professor's strategy? Explain 
the relationship to symbol coding. 

bxercise 5.31. [2 ' p ' 106] Assume that a sequence of symbols from the ensemble 
X introduced at the beginning of this chapter is compressed using the 
code C3. Imagine picking one bit at random from the binary encoded 
sequence c = c{x\)c(x2)c{x%) .... What is the probability that this bit 
is a 1? 



> Exercise 5.32.^' P- 10 ^ How should the binary Huffman encoding scheme be 
modified to make optimal symbol codes in an encoding alphabet with q 
symbols? (Also known as 'radix q\) 

Mixture codes 

It is a tempting idea to construct a 'metacode' from several symbol codes that 
assign different-length codewords to the alternative symbols, then switch from 
one code to another, choosing whichever assigns the shortest codeword to the 
current symbol. Clearly we cannot do this for free. If one wishes to choose 
between two codes, then it is necessary to lengthen the message in a way that 
indicates which of the two codes is being used. If we indicate this choice by 
a single leading bit, it will be found that the resulting code is sub optimal 
because it is incomplete (that is, it fails the Kraft equality). 

2 V| Exercise 5.33 p ' 108 ' Prove that this metacode is incomplete, and explain 
why this combined code is suboptimal. 
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► 5.9 Solutions 

Solution to exercise 5.8 (p.93). Yes, C2 = {1,101} is uniquely decodeable, 
even though it is not a prefix code, because no two different strings can map 
onto the same string; only the codeword 0(02) = 101 contains the symbol 0. 

Solution to exercise 5.14 (p. 95). We wish to prove that for any set of codeword 
lengths {k} satisfying the Kraft inequality, there is a prefix code having those 
lengths. This is readily proved by thinking of the codewords illustrated in 
figure 5.8 as being in a 'codeword supermarket', with size indicating cost. 
We imagine purchasing codewords one at a time, starting from the shortest 
codewords (i.e., the biggest purchases), using the budget shown at the right 
of figure 5.8. We start at one side of the codeword supermarket, say the 
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Figure 5.8. The codeword 
supermarket and the symbol 
coding budget. The 'cost' 2~ ; of 
each codeword (with length I) is 
indicated by the size of the box it 
is written in. The total budget 
available when making a uniquely 
dccodcablc code is 1. 
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top, and purchase the first codeword of the required length. We advance 
down the supermarket a distance 2~ l , and purchase the next codeword of the 
next required length, and so forth. Because the codeword lengths are getting 
longer, and the corresponding intervals are getting shorter, we can always 
buy an adjacent codeword to the latest purchase, so there is no wasting of 
the budget. Thus at the Ith codeword we have advanced a distance J2i=i 
down the supermarket; if < 1, we will have purchased all the codewords 

without running out of budget. 



Figure 5.9. Proof that Huffman 
coding makes an optimal symbol 
code. We assume that the rival 
code, which is said to be optimal, 
assigns unequal length codewords 
to the two symbols with smallest 
probability, a and b. By 
interchanging codewords a and c 
of the rival code, where c is a 
symbol with rival codelength as 
long as b's, we can make a code 
better than the rival code. This 
shows that the rival code was not 
optimal. 



Solution to exercise 5.16 (p. 99). The proof that Huffman coding is optimal 
depends on proving that the key step in the algorithm - the decision to give 
the two symbols with smallest probability equal encoded lengths - cannot 
lead to a larger expected length than any other code. We can prove this by 
contradiction. 

Assume that the two symbols with smallest probability, called a and b, 
to which the Huffman algorithm would assign equal length codewords, do not 
have equal lengths in any optimal symbol code. The optimal symbol code 
is some other rival code in which these two codewords have unequal lengths 
l a and lb with l a < If,. Without loss of generality we can assume that this 
other code is a complete prefix code, because any codelengths of a uniquely 
decodeable code can be realized by a prefix code. 

In this rival code, there must be some other symbol c whose probability 
p c is greater than p a and whose length in the rival code is greater than or 
equal to l , because the code for b must have an adjacent codeword of equal 
or greater length - a complete prefix code never has a solo codeword of the 
maximum length. 

Consider exchanging the codewords of a and c (figure 5.9), so that a is 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



106 



5 — Symbol Codes 



encoded with the longer codeword that was c's, and c, which is more probable 
than a, gets the shorter codeword. Clearly this reduces the expected length 
of the code. The change in expected length is (p a — p c )(l c — la)- Thus we have 
contradicted the assumption that the rival code is optimal. Therefore it is 
valid to give the two symbols with smallest probability equal encoded lengths. 
Huffman coding produces optimal symbol codes. □ 

Solution to exercise 5.21 (p. 102). A Huffman code for X 2 where Ax = {0, 1} 
and V x = {0.9,0.1} is {00,01,10,11} -> {1,01,000,001}. This code has 
L(C,X 2 ) = 1.29, whereas the entropy H{X 2 ) is 0.938. 
A Huffman code for X 3 is 

{000, 100,010,001, 101,011, 110, 111} -> 

{1, 011, 010, 001, 00000, 00001, 00010, 00011}. 

This has expected length L(C, X ?> ) = 1.598 whereas the entropy H(X 3 ) is 
1.4069. 

A Huffman code for X 4 maps the sixteen source strings to the following 
codelengths: 

{0000, 1000,0100,0010,0001, 1100,0110,0011,0101, 1010, 1001, 1110, 1101, 
1011,0111, 1111} {1,3,3,3,4,6,7,7,7,7,7,9,9,9,10,10}. 

This has expected length L(C, X A ) = 1.9702 whereas the entropy H{X A ) is 
1.876. 

When Vx = {0.6,0.4}, the Huffman code for X 2 has lengths {2,2,2,2}; 
the expected length is 2 bits, and the entropy is 1.94 bits. A Huffman code for 
X 4 is shown in table 5.10. The expected length is 3.92 bits, and the entropy 
is 3.88 bits. 

Solution to exercise 5.22 (p. 102). The set of probabilities {pi,P2,Ps,P4} = 
{V6, y&, Vz, Vz} gives rise to two different optimal sets of codelengths, because 
at the second step of the Huffman coding algorithm we can choose any of the 
three possible pairings. We may either put them in a constant length code 
{00,01, 10, 11} or the code {000,001,01, 1}. Both codes have expected length 
2. 

Another solution is {pi,P2,P3,Pi} = {V5, V5, V5, %}. 
And a third is {p\,P2,Pz,Pi} = {V3, %, Vz, 0}. 

Solution to exercise 5.26 (p. 103). Let p max be the largest probability in 
pi,P2, ■ ■ ■ ,Pi- The difference between the expected length L and the entropy 
H can be no bigger than max(p max , 0.086) (Gallager, 1978). 

See exercises 5.27-5.28 to understand where the curious 0.086 comes from. 

Solution to exercise 5.27 (p. 103). Length — entropy = 0.086. 

Solution to exercise 5.31 (p. 104). There are two ways to answer this problem 
correctly, and one popular way to answer it incorrectly. Let's give the incorrect 
answer first: 

Erroneous answer. "We can pick a random bit by first picking a random 
source symbol X{ with probability p,, then picking a random bit from 
c(xi). If we define fi to be the fraction of the bits of c(xj) that are Is, 
we find 

p(bitisi) = Y,nfi ( 5 - 34 ) 

i 

= V2 X + l k X V2 + V8 X % + V8 X 1 = 1/3." (5.35) 



di 


Pi 


U 


c(a,i) 


0000 


0.1296 


3 


000 


0001 


0.0864 


4 


0100 


0010 


0.0864 


4 


0110 


0100 


0.0864 


4 


0111 


1000 


0.0864 


3 


100 


1100 


0.0576 


4 


1010 


1010 


0.0576 


4 


1100 


1001 


0.0576 


4 


1101 


0110 


0.0576 


4 


1110 


0101 


0.0576 


4 


1111 


0011 


0.0576 


4 


0010 


1110 


0.0384 


5 


00110 


1101 


0.0384 


5 


01010 


1011 


0.0384 


5 


01011 


0111 


0.0384 


4 


1011 


1111 


0.0256 


5 


00111 



Table 5.10. Huffman code for X* 
when po = 0.6. Column 3 shows 
the assigned codelengths and 
column 4 the codewords. Some 
strings whose probabilities are 
identical, e.g., the fourth and 
fifth, receive different codelengths. 



C 3 





c(Oj) 


Pi 


h 


a 





1/2 


1 


b 


10 


1/4 


2 


c 


110 


1/8 


3 


d 


111 


1/8 


3 
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5.9: Solutions 

This answer is wrong because it falls for the bus-stop fallacy, which was intro- 
duced in exercise 2.35 (p. 38): if buses arrive at random, and we are interested 
in 'the average time from one bus until the next', we must distinguish two 
possible averages: (a) the average time from a randomly chosen bus until the 
next; (b) the average time between the bus you just missed and the next bus. 
The second 'average' is twice as big as the first because, by waiting for a bus 
at a random time, you bias your selection of a bus in favour of buses that 
follow a large gap. You're unlikely to catch a bus that comes 10 seconds after 
a preceding bus! Similarly, the symbols c and d get encoded into longer-length 
binary strings than a, so when we pick a bit from the compressed string at 
random, we are more likely to land in a bit belonging to a c or a d than would 
be given by the probabilities pi in the expectation (5.34). All the probabilities 
need to be scaled up by U, and renormalized. 

Correct answer in the same style. Every time symbol xi is encoded, k 
bits are added to the binary string, of which /jZj are Is. The expected 
number of Is added per symbol is 

(5.36) 

i 

and the expected total number of bits added per symbol is 

i 

So the fraction of Is in the transmitted string is 

P(bitisl) = SiM^ (5.38) 

L-ii ViH 

V2 x + V4 x 1 + V8 x 2 + 1/8 x 3 % 1 ,„ 

For a general symbol code and a general ensemble, the expectation (5.38) is 
the correct answer. But in this case, we can use a more powerful argument. 

Information-theoretic answer. The encoded string c is the output of an 
optimal compressor that compresses samples from X down to an ex- 
pected length of H{X) bits. We can't expect to compress this data any 
further. But if the probability P(bit is 1) were not equal to 1 h then it 
would be possible to compress the binary string further (using a block 
compression code, say). Therefore P(bit is 1) must be equal to V2; in- 
deed the probability of any sequence of I bits in the compressed stream 
taking on any particular value must be 1~ l . The output of a perfect 
compressor is always perfectly random bits. 

To put it another way, if the probability P(bit is 1) were not equal to 
V2, then the information content per bit of the compressed string would 
be at most H2(P(l)), which would be less than 1; but this contradicts 
the fact that we can recover the original data from c, so the information 
content per bit of the compressed string must be H(X)/L(C,X) = 1. 

Solution to exercise 5.32 (p. 104). The general Huffman coding algorithm for 
an encoding alphabet with q symbols has one difference from the binary case. 
The process of combining q symbols into 1 symbol reduces the number of 
symbols by q— 1. So if we start with A symbols, we'll only end up with a 



(5.37) 
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108 5 — Symbol Codes 

complete g-ary tree if Amod(q— 1) is equal to 1. Otherwise, we know that 
whatever prefix code we make, it must be an incomplete tree with a number 
of missing leaves equal, modulo (q — 1), to i4mod(g- 1) — 1. For example, if 
a ternary tree is built for eight symbols, then there will unavoidably be one 
missing leaf in the tree. 

The optimal g-ary code is made by putting these extra leaves in the longest 
branch of the tree. This can be achieved by adding the appropriate number 
of symbols to the original source symbol set, all of these extra symbols having 
probability zero. The total number of leaves is then equal to r(q—l) + 1, for 
some integer r. The symbols are then repeatedly combined by taking the q 
symbols with smallest probability and replacing them by a single symbol, as 
in the binary Huffman coding algorithm. 

Solution to exercise 5.33 (p. 104). We wish to show that a greedy metacode, 
which picks the code which gives the shortest encoding, is actually suboptimal, 
because it violates the Kraft inequality. 

We'll assume that each symbol x is assigned lengths Zfe(x) by each of the 
candidate codes Ck- Let us assume there are K alternative codes and that we 
can encode which code is being used with a header of length log K bits. Then 
the metacode assigns lengths l'(x) that are given by 

l'(x) = log 2 K + mml k (x). (5.39) 

fc 

We compute the Kraft sum: 

5 — y^2~ f '^ = — y^2~ minfc ' fc ( a: ). (5.40) 

X X 

Let's divide the set Ax into non-overlapping subsets {Ak)k=i sucn that subset 
Ak contains all the symbols x that the metacode sends via code k. Then 

s^EE 2 ~ lk[x) - ( 5 - 41 ) 

k xEAk 

Now if one sub-code k satisfies the Kraft equality J2xeA x ^~ = L then it 
must be the case that 

2~ lk(x) < 1, (5.42) 

xEA k 

with equality only if all the symbols x are in Ak, which would mean that we 
are only using one of the K codes. So 

1 K 

k=i 

with equality only if equation (5.42) is an equality for all codes k. But it's 
impossible for all the symbols to be in all the non-overlapping subsets {Ak}% = i, 
so we can't have equality (5.42) holding for all k. So S < 1. 

Another way of seeing that a mixture code is suboptimal is to consider 
the binary tree that it defines. Think of the special case of two codes. The 
first bit we send identifies which code we are using. Now, in a complete code, 
any subsequent binary string is a valid string. But once we know that we 
are using, say, code A, we know that what follows can only be a codeword 
corresponding to a symbol x whose encoding is shorter under code A than 
code B. So some strings are invalid continuations, and the mixture code is 
incomplete and suboptimal. 

For further discussion of this issue and its relationship to probabilistic 
modelling read about 'bits back coding' in section 28.3 and in Frey (1998). 
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About Chapter 6 



Before reading Chapter 6, you should have read the previous chapter and 
worked on most of the exercises in it. 

We'll also make use of some Bayesian modelling ideas that arrived in the 
vicinity of exercise 2.8 (p. 30). 
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6 



Stream Codes 



In this chapter we discuss two data compression schemes. 

Arithmetic coding is a beautiful method that goes hand in hand with the 
philosophy that compression of data from a source entails probabilistic mod- 
elling of that source. As of 1999, the best compression methods for text files 
use arithmetic coding, and several state-of-the-art image compression systems 
use it too. 

Lempel-Ziv coding is a 'universal' method, designed under the philosophy 
that we would like a single compression algorithm that will do a reasonable job 
for any source. In fact, for many real life sources, this algorithm's universal 
properties hold only in the limit of unfeasibly large amounts of data, but, all 
the same, Lempel-Ziv compression is widely used and often effective. 

► 6.1 The guessing game 

As a motivation for these two compression methods, consider the redundancy 
in a typical English text file. Such files have redundancy at several levels: for 
example, they contain the ASCII characters with non-equal frequency; certain 
consecutive pairs of letters are more probable than others; and entire words 
can be predicted given the context and a semantic understanding of the text. 

To illustrate the redundancy of English, and a curious way in which it 
could be compressed, we can imagine a guessing game in which an English 
speaker repeatedly attempts to predict the next character in a text file. 

For simplicity, let us assume that the allowed alphabet consists of the 26 
upper case letters A,B,C, . . . , Z and a space '-'. The game involves asking 
the subject to guess the next character repeatedly, the only feedback being 
whether the guess is correct or not, until the character is correctly guessed. 
After a correct guess, we note the number of guesses that were made when 
the character was identified, and ask the subject to guess the next character 
in the same way. 

One sentence gave the following result when a human was asked to guess 
a sentence. The numbers of guesses are listed below each character. 

THERE-IS-N0-REVERSE-0N-A-M0T0RCYCLE- 

111511211211 15 1 17 111213212271111411111 

Notice that in many cases, the next letter is guessed immediately, in one 
guess. In other cases, particularly at the start of syllables, more guesses are 
needed. 

What do this game and these results offer us? First, they demonstrate the 
redundancy of English from the point of view of an English speaker. Second, 
this game might be used in a data compression scheme, as follows. 
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6.2: Arithmetic codes 

The string of numbers '1, 1, 1, 5, 1, listed above, was obtained by 

presenting the text to the subject. The maximum number of guesses that the 
subject will make for a given letter is twenty-seven, so what the subject is 
doing for us is performing a time- varying mapping of the twenty-seven letters 
{A, B, C, . . . , Z, — } onto the twenty-seven numbers {1, 2, 3, . . . , 27}, which we 
can view as symbols in a new alphabet. The total number of symbols has not 
been reduced, but since he uses some of these symbols much more frequently 
than others - for example, 1 and 2 - it should be easy to compress this new 
string of symbols. 

How would the uncompression of the sequence of numbers '1, 1, 1,5, 1, 
work? At uncompression time, we do not have the original string 'THERE. . . ', 
we have only the encoded sequence. Imagine that our subject has an absolutely 
identical twin who also plays the guessing game with us, as if we knew the 
source text. If we stop him whenever he has made a number of guesses equal to 
the given number, then he will have just guessed the correct letter, and we can 
then say 'yes, that's right', and move to the next character. Alternatively, if 
the identical twin is not available, we could design a compression system with 
the help of just one human as follows. We choose a window length L, that is, 
a number of characters of context to show the human. For every one of the 
27 L possible strings of length L, we ask them, 'What would you predict is the 
next character?', and 'If that prediction were wrong, what would your next 
guesses be?'. After tabulating their answers to these 26 x 27 L questions, we 
could use two copies of these enormous tables at the encoder and the decoder 
in place of the two human twins. Such a language model is called an Lth order 
Markov model. 

These systems are clearly unrealistic for practical compression, but they 
illustrate several principles that we will make use of now. 

► 6.2 Arithmetic codes 

When we discussed variable-length symbol codes, and the optimal Huffman 
algorithm for constructing them, we concluded by pointing out two practical 
and theoretical problems with Huffman codes (section 5.6). 

These defects are rectified by arithmetic codes, which were invented by 
Elias, by Rissanen and by Pasco, and subsequently made practical by Witten 
et al. (1987). In an arithmetic code, the probabilistic modelling is clearly 
separated from the encoding operation. The system is rather similar to the 
guessing game. The human predictor is replaced by a probabilistic model of 
the source. As each symbol is produced by the source, the probabilistic model 
supplies a predictive distribution over all possible values of the next symbol, 
that is, a list of positive numbers {pi} that sum to one. If we choose to model 
the source as producing i.i.d. symbols with some known distribution, then the 
predictive distribution is the same every time; but arithmetic coding can with 
equal ease handle complex adaptive models that produce context-dependent 
predictive distributions. The predictive model is usually implemented in a 
computer program. 

The encoder makes use of the model's predictions to create a binary string. 
The decoder makes use of an identical twin of the model (just as in the guessing 
game) to interpret the binary string. 

Let the source alphabet be Ax = • • • , and let the 7th symbol aj 
have the special meaning 'end of transmission'. The source spits out a sequence 
x\,X2, • • • , x n , .... The source does not necessarily produce i.i.d. symbols. We 
will assume that a computer program is provided to the encoder that assigns a 
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predictive probability distribution over a, given the sequence that has occurred 
thus far, P(x n = cii\xi, . . . , x n -\). The receiver has an identical program that 
produces the same predictive probability distribution P(x n = di \ x±, . . . , x„_i). 



0.00 
0.25 
0.50 
0.75 
1.00 



01 



01101 



Figure 6.1. Binary strings define 
real intervals within the real line 
[0,1). We first encountered a 
picture like this when we 
discussed the symbol-code 
supermarket in Chapter 5. 



Concepts for understanding arithmetic coding 

Notation for intervals. The interval [0.01,0.10) is all numbers between 0.01 and 
0.10, including 0.010 = 0.01000 . . . but not 0.100 = 0.10000 .... 

A binary transmission defines an interval within the real line from to 1. 
For example, the string 01 is interpreted as a binary real number 0.01. . . , which 
corresponds to the interval [0.01,0.10) in binary, i.e., the interval [0.25,0.50) 
in base ten. 

The longer string 01101 corresponds to a smaller interval [0.01101, 
0.01110). Because 01101 has the first string, 01, as a prefix, the new in- 
terval is a sub-interval of the interval [0.01,0.10). A one-megabyte binary file 
(2 23 bits) is thus viewed as specifying a number between and 1 to a precision 
of about two million decimal places - two million decimal digits, because each 
byte translates into a little more than two decimal digits. 

Now, we can also divide the real line [0,1) into I intervals of lengths equal 
to the probabilities P{x\ = a^), as shown in figure 6.2. 

0.00 

P(x 1 = a 1 ) — - 



\ ai 



P{xi = ai) + P(xi = a 2 ) 



a 2 



a 2 ai 



a 2 a 5 



Figure 6.2. A probabilistic model 
defines real intervals within the 
real line [0,1). 



P{x 1 = a i ) 



P(x 1 =a I - i ) 
1.0 



\ a/ 



We may then take each interval ai and subdivide it into intervals de- 
noted (ZjOi, a^a 2 , . . . , ajar, such that the length of a^aj is proportional to 
P(x 2 = aj \ x\ = ai). Indeed the length of the interval a%aj will be precisely 
the joint probability 



P(xi = ai,x 2 = aj) = P(x\ = ai)P(x 2 = aj \ x\=ai). 



(6.1) 



Iterating this procedure, the interval [0, 1) can be divided into a sequence 
of intervals corresponding to all possible finite length strings x\x 2 ■ ■ ■ xjy, such 
that the length of an interval is equal to the probability of the string given 
our model. 
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u : = 


. 


v : = 1 


.0 


p := v 


— u 


for n 


= 1 to N { 




Compute the cumulative probabilities Q n and R n (6.2, 6.3) 




v := u + pR n {x n \xx,...,x n -{) 




u := u+pQ n (x n \xi,...,x n -i) 




p : = v — u 


> 





Algorithm 6.3. Arithmetic coding. 
Iterative procedure to find the 
interval [it, v) for the string 



Formulae describing arithmetic coding 

The process depicted in figure 6.2 can be written explicitly as follows. The 
intervals are defined in terms of the lower and upper cumulative probabilities 

i-i 

Q n (a i \x 1 ,...,x n - 1 ) = ^ P(x n = a i >\xi,...,x n - 1 ), (6.2) 
v = 1 

i 

R n (a i \x 1 ,...,x n - 1 ) = ^2 P(x n = Oi> \x 1 ,...,x n - 1 ). (6.3) 
v = 1 

As the nth symbol arrives, we subdivide the n— 1th interval at the points defined 
by Q n and R n . For example, starting with the first symbol, the intervals 'oi', 
'a 2 ', and 'a/' are 

oi <-> [Qi(oi),iZi(oi)) = [0,P(zi=ai)), (6.4) 
a 2 <-> [Qi(a 2 ),i?i(a 2 )) = [P(x = a l ),P{x = a 1 ) + P{x = a 2 )) , (6.5) 

and 

oj « [Qi(o/),iIi(aj)) = [P(xi = ai) + ... + P(xi = aj_i),1.0). (6.6) 
Algorithm 6.3 describes the general procedure. 

To encode a string a;ia;2 • • • xjy, we locate the interval corresponding to 
x\X2 ■ ■ - xn, and send a binary string whose interval lies within that interval. 
This encoding can be performed on the fly, as we now illustrate. 

Example: compressing the tosses of a bent coin 

Imagine that we watch as a bent coin is tossed some number of times (cf. 
example 2.7 (p. 30) and section 3.2 (p. 51)). The two outcomes when the coin 
is tossed are denoted a and b. A third possibility is that the experiment is 
halted, an event denoted by the 'end of file' symbol, Because the coin is 
bent, we expect that the probabilities of the outcomes a and b are not equal, 
though beforehand we don't know which is the more probable outcome. 

Encoding 



Let the source string be 'bbbaD'. We pass along the string one symbol at a 
time and use our model to compute the probability distribution of the next 
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symbol given the string thus far. Let these probabilities be: 



Context 

(sequence thus far) 



Probability of next symbol 





P(a) 


= 0.425 


P(b) 


= 0.425 


P(D) 


= 0.15 


b 


P(a|b) 


= 0.28 


P(b|b) 


= 0.57 


P(D|b) 


= 0.15 


bb 


P(a|bb) 


= 0.21 


P(b | bb) 


= 0.64 


P(D | bb) 


= 0.15 


bbb 


P(a | bbb) 


= 0.17 


P(b | bbb) 


= 0.68 


P(D|bbb) 


= 0.15 


bbba 


P(a|bbba) 


= 0.28 


P(b | bbba) 


= 0.57 


P(n |bbba) 


= 0.15 



Figure 6.4 shows the corresponding intervals. The interval b is the middle 
0.425 of [0, 1). The interval bb is the middle 0.567 of b, and so forth. 



ba 



bba 



bbba 



bb bbb bbbb 
bbbD 



bbD 



bD 



00000 



00001 



0000 



00010 



000 



00011 



0001 



00100 



00 



00101 



0010 



00110 



001 



00111 



0011 



01000 



01001 



0100 



01010 



010 



01011 



0101 



01100 



01 



01101 



0110 



OHIO 



011 



01111 



0111 



10000 



10001 



1000 



10010 



100 



10011 



1001 



10100 



10 



10101 



1010 



10110 



101 



10111 



1011 



11000 



11001 
1101 
11011 
1110 
11101 
1111 

11111 



1100 



110 



1101 



11 



1110 



111 



1111 



bbbaa 



bbba 



bbbab 



bbbaD 



10010111 



1001100 
10011001 
1001101 
10011011 
1001110 
10011101 
1001111 
, 10011111 



\10100000 



100111101 



Figure 6.4. Illustration of the 
arithmetic coding process as the 
sequence bbbaD is transmitted. 



10011 



When the first symbol 'b' is observed, the encoder knows that the encoded 
string will start '01', '10', or '11', but does not know which. The encoder 
writes nothing for the time being, and examines the next symbol, which is 'b'. 
The interval 'bb' lies wholly within interval '1', so the encoder can write the 
first bit: '1'. The third symbol 'b' narrows down the interval a little, but not 
quite enough for it to lie wholly within interval '10'. Only when the next 'a' 
is read from the source can we transmit some more bits. Interval 'bbba' lies 
wholly within the interval '1001', so the encoder adds '001' to the '1' it has 
written. Finally when the '□' arrives, we need a procedure for terminating the 
encoding. Magnifying the interval 'bbbaD' (figure 6.4, right) we note that the 
marked interval '100111101' is wholly contained by bbbaD, so the encoding 
can be completed by appending '11101'. 
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6.2: Arithmetic codes 

Exercise 6.1.^' p ' 12 ^ Show that the overhead required to terminate a message 
is never more than 2 bits, relative to the ideal message length given the 
probabilistic model 7i, /i(x | TL) = log[l/P(x | TL)\. 

This is an important result. Arithmetic coding is very nearly optimal. The 
message length is always within two bits of the Shannon information content 
of the entire source string, so the expected message length is within two bits 
of the entropy of the entire message. 

Decoding 

The decoder receives the string '100111101' and passes along it one symbol 
at a time. First, the probabilities P(a), P(b), P(D) are computed using the 
identical program that the encoder used and the intervals 'a', 'b' and '□' are 
deduced. Once the first two bits '10' have been examined, it is certain that 
the original string must have been started with a 'b', since the interval '10' lies 
wholly within interval 'b'. The decoder can then use the model to compute 
P(a | b), P(b | b), P(n | b) and deduce the boundaries of the intervals 'ba', 'bb' 
and 'bD'. Continuing, we decode the second b once we reach '1001', the third 
b once we reach '100111', and so forth, with the unambiguous identification 
of 'bbbaD' once the whole binary string has been read. With the convention 
that '□' denotes the end of the message, the decoder knows to stop decoding. 

Transmission of multiple files 

How might one use arithmetic coding to communicate several distinct files over 
the binary channel? Once the □ character has been transmitted, we imagine 
that the decoder is reset into its initial state. There is no transfer of the learnt 
statistics of the first file to the second file. If, however, we did believe that 
there is a relationship among the files that we are going to compress, we could 
define our alphabet differently, introducing a second end-of-file character that 
marks the end of the file but instructs the encoder and decoder to continue 
using the same probabilistic model. 

The big picture 

Notice that to communicate a string of N letters both the encoder and the 
decoder needed to compute only N\A\ conditional probabilities - the proba- 
bilities of each possible letter in each context actually encountered - just as in 
the guessing game. This cost can be contrasted with the alternative of using 
a Huffman code with a large block size (in order to reduce the possible one- 
bit-per-symbol overhead discussed in section 5.6), where all block sequences 
that could occur must be considered and their probabilities evaluated. 

Notice how flexible arithmetic coding is: it can be used with any source 
alphabet and any encoded alphabet. The size of the source alphabet and the 
encoded alphabet can change with time. Arithmetic coding can be used with 
any probability distribution, which can change utterly from context to context. 

Furthermore, if we would like the symbols of the encoding alphabet (say, 
and 1) to be used with unequal frequency, that can easily be arranged by 
subdividing the right-hand interval in proportion to the required frequencies. 

How the probabilistic model might make its predictions 

The technique of arithmetic coding does not force one to produce the predic- 
tive probability in any particular way, but the predictive distributions might 
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Figure 6.5. Illustration of the 
intervals defined by a simple 
Baycsian probabilistic model. The 
size of an intervals is proportional 
to the probability of the string. 
This model anticipates that the 
source is likely to be biased 
towards one of a and b, so 
sequences having lots of as or lots 
of bs have larger intervals than 
sequences of the same length that 
arc 50:50 as and bs. 



naturally be produced by a Bayesian model. 

Figure 6.4 was generated using a simple model that always assigns a prob- 
ability of 0.15 to □, and assigns the remaining 0.85 to a and b, divided in 
proportion to probabilities given by Laplace's rule, 

p _|_ i 

P L (a.\x 1 ,...,x n - 1 )= a (6.7) 

fa + i*b + i 

where F a (x\, . . . , x n -\) is the number of times that a has occurred so far, and 
F b is the count of bs. These predictions corresponds to a simple Bayesian 
model that expects and adapts to a non-equal frequency of use of the source 
symbols a and b within a file. 

Figure 6.5 displays the intervals corresponding to a number of strings of 
length up to five. Note that if the string so far has contained a large number of 
bs then the probability of b relative to a is increased, and conversely if many 
as occur then as are made more probable. Larger intervals, remember, require 
fewer bits to encode. 



Details of the Bayesian model 



Having emphasized that any model could be used - arithmetic coding is not 
wedded to any particular set of probabilities - let me explain the simple adaptive 
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6.2: Arithmetic codes 

probabilistic model used in the preceding example; we first encountered this 
model in exercise 2.8 (p. 30). 

Assumptions 

The model will be described using parameters pa, p a and Pb, defined below, 
which should not be confused with the predictive probabilities in a particular 
context, for example, P(a | s = baa). A bent coin labelled a and b is tossed some 
number of times I, which we don't know beforehand. The coin's probability of 
coming up a when tossed is p a , and Pb = 1 — p a ; the parameters p a ,Pb are not 
known beforehand. The source string s = baabaQ indicates that I was 5 and 
the sequence of outcomes was baaba. 

1 . It is assumed that the length of the string / has an exponential probability 
distribution 

P(l) = (1 - PuYpn- (6.8) 

This distribution corresponds to assuming a constant probability p u for 
the termination symbol '□' at each character. 

2. It is assumed that the non- terminal characters in the string are selected in- 
dependently at random from an ensemble with probabilities V = {p a ,Pb}; 
the probability p a is fixed throughout the string to some unknown value 
that could be anywhere between and 1 . The probability of an a occur- 
ring as the next symbol, given p a (if only we knew it), is (1 — Pu)p a . The 
probability, given p a , that an unterminated string of length F is a given 
string s that contains {F a , F b } counts of the two outcomes is the Bernoulli 
distribution 

P(s|p a ,F)=pf*(l- Pa ) F \ (6.9) 

3. We assume a uniform prior distribution for p a , 

P(p a ) = l, p a e[0,l], (6.10) 

and define p b = 1 — p a . It would be easy to assume other priors on p a , 
with beta distributions being the most convenient to handle. 

This model was studied in section 3.2. The key result we require is the predictive 
distribution for the next symbol, given the string so far, s. This probability 
that the next character is a or b (assuming that it is not '□') was derived in 
equation (3.16) and is precisely Laplace's rule (6.7). 

> Exercise 6.2. ^ Compare the expected message length when an ASCII file is 
compressed by the following three methods. 

Huffman-with-header. Read the whole file, find the empirical fre- 
quency of each symbol, construct a Huffman code for those frequen- 
cies, transmit the code by transmitting the lengths of the Huffman 
codewords, then transmit the file using the Huffman code. (The 
actual codewords don't need to be transmitted, since we can use a 
deterministic method for building the tree given the codelengths.) 

Arithmetic code using the Laplace model. 

F a + 1 

P L (a.\x 1 ,...,x n _ 1 ) = — + ( 6 - n ) 

Arithmetic code using a Dirichlet model. This model's predic- 
tions are: 

F. + a . 
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where a is fixed to a number such as 0.01. A small value of a 
corresponds to a more responsive version of the Laplace model; 
the probability over characters is expected to be more nonuniform; 
a = 1 reproduces the Laplace model. 

Take care that the header of your Huffman message is self-delimiting. 
Special cases worth considering are (a) short files with just a few hundred 
characters; (b) large files in which some characters are never used. 



► 6.3 Further applications of arithmetic coding 

Efficient generation of random samples 

Arithmetic coding not only offers a way to compress strings believed to come 
from a given model; it also offers a way to generate random strings from a 
model. Imagine sticking a pin into the unit interval at random, that line 
having been divided into subintervals in proportion to probabilities pi\ the 
probability that your pin will lie in interval i is pi. 

So to generate a sample from a model, all we need to do is feed ordinary 
random bits into an arithmetic decoder for that model. An infinite random 
bit sequence corresponds to the selection of a point at random from the line 
[0,1), so the decoder will then select a string at random from the assumed 
distribution. This arithmetic method is guaranteed to use very nearly the 
smallest number of random bits possible to make the selection - an important 
point in communities where random numbers are expensive! [This is not a joke. 
Large amounts of money are spent on generating random bits in software and 
hardware. Random numbers are valuable.] 

A simple example of the use of this technique is in the generation of random 
bits with a nonuniform distribution {po,Pi}. 



Exercise 6.3.'- ' p ' ' Compare the following two techniques for generating 



random symbols from a nonuniform distribution {po,Pi} = {0.99,0.01}: 

(a) The standard method: use a standard random number generator 
to generate an integer between 1 and 2 32 . Rescale the integer to 
(0, 1). Test whether this uniformly distributed random variable is 
less than 0.99, and emit a or 1 accordingly. 

(b) Arithmetic coding using the correct model, fed with standard ran- 
dom bits. 

Roughly how many random bits will each method use to generate a 
thousand samples from this sparse distribution? 

Efficient data-entry devices 

When we enter text into a computer, we make gestures of some sort - maybe 
we tap a keyboard, or scribble with a pointer, or click with a mouse; an 
efficient text entry system is one where the number of gestures required to 
enter a given text string is small. 

Writing can be viewed as an inverse process to data compression. In data 
compression, the aim is to map a given text string into a small number of bits. 
In text entry, we want a small sequence of gestures to produce our intended 
text. 

By inverting an arithmetic coder, we can obtain an information-efficient 
text entry device that is driven by continuous pointing gestures (Ward et al, 
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Compression: 
text — » bits 

Writing: 

text <— gestures 
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6.4: Lempel-Ziv coding 

2000). In this system, called Dasher, the user zooms in on the unit interval to 
locate the interval corresponding to their intended string, in the same style as 
figure 6.4. A language model (exactly as used in text compression) controls 
the sizes of the intervals such that probable strings are quick and easy to 
identify. After an hour's practice, a novice user can write with one finger 
driving Dasher at about 25 words per minute - that's about half their normal 
ten-finger typing speed on a regular keyboard. It's even possible to write at 25 
words per minute, hands-free, using gaze direction to drive Dasher (Ward and 
MacKay, 2002). Dasher is available as free software for various platforms. 1 

► 6.4 Lempel-Ziv coding 

The Lempel-Ziv algorithms, which are widely used for data compression (e.g., 
the compress and gzip commands), are different in philosophy to arithmetic 
coding. There is no separation between modelling and coding, and no oppor- 
tunity for explicit modelling. 

Basic Lempel-Ziv algorithm 

The method of compression is to replace a substring with a pointer to 
an earlier occurrence of the same substring. For example if the string is 
1011010100010.. . , we parse it into an ordered dictionary of substrings that 
have not appeared before as follows: A, 1, 0, 11, 01, 010, 00, 10, .... We in- 
clude the empty substring A as the first substring in the dictionary and order 
the substrings in the dictionary by the order in which they emerged from the 
source. After every comma, we look along the next part of the input sequence 
until we have read a substring that has not been marked off before. A mo- 
ment's reflection will confirm that this substring is longer by one bit than a 
substring that has occurred earlier in the dictionary. This means that we can 
encode each substring by giving a pointer to the earlier occurrence of that pre- 
fix and then sending the extra bit by which the new substring in the dictionary 
differs from the earlier substring. If, at the nth bit, we have enumerated s(n) 
substrings, then we can give the value of the pointer in [log 2 s(n)~\ bits. The 
code for the above sequence is then as shown in the fourth line of the following 
table (with punctuation included for clarity), the upper lines indicating the 
source string and the value of s(n): 

source substrings A 1 11 01 010 00 10 

s(n) 1 2 3 4 5 6 7 

sHbinary 000 001 010 011 100 101 110 111 

(pointer, bit) (, 1) (0,0) (01,1) (10,1) (100,0) (010,0) (001,0) 

Notice that the first pointer we send is empty, because, given that there is 
only one substring in the dictionary - the string A - no bits are needed to 
convey the 'choice' of that substring as the prefix. The encoded string is 
100011101100001000010. The encoding, in this simple case, is actually a 
longer string than the source string, because there was no obvious redundancy 
in the source string. 

l> Exercise 6.4. ^ Prove that any uniquely decodeable code from {0, l}+ to 
{0, l} + necessarily makes some strings longer if it makes some strings 
shorter. 



x http : //www . inference . phy . cam . ac . uk/dasher/ 
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120 6 — Stream Codes 

One reason why the algorithm described above lengthens a lot of strings is 
because it is inefficient - it transmits unnecessary bits; to put it another way, 
its code is not complete. Once a substring in the dictionary has been joined 
there by both of its children, then we can be sure that it will not be needed 
(except possibly as part of our protocol for terminating a message); so at that 
point we could drop it from our dictionary of substrings and shuffle them 
all along one, thereby reducing the length of subsequent pointer messages. 
Equivalently, we could write the second prefix into the dictionary at the point 
previously occupied by the parent. A second unnecessary overhead is the 
transmission of the new bit in these cases - the second time a prefix is used, 
we can be sure of the identity of the next bit. 

Decoding 

The decoder again involves an identical twin at the decoding end who con- 
structs the dictionary of substrings as the data are decoded. 

O Exercise 6.5. ^' p ' 128 ^ Encode the string 000000000000100000000000 using 
the basic Lempel-Ziv algorithm described above. 

> Exercise 6.6. ^' p l28 l Decode the string 

00101011101100100100011010101000011 

that was encoded using the basic Lempel-Ziv algorithm. 

Practicalities 

In this description I have not discussed the method for terminating a string. 

There are many variations on the Lempel-Ziv algorithm, all exploiting the 
same idea but using different procedures for dictionary management, etc. The 
resulting programs are fast, but their performance on compression of English 
text, although useful, does not match the standards set in the arithmetic 
coding literature. 

Theoretical properties 

In contrast to the block code, Huffman code, and arithmetic coding methods 
we discussed in the last three chapters, the Lempel-Ziv algorithm is defined 
without making any mention of a probabilistic model for the source. Yet, given 
any ergodic source (i.e., one that is memoryless on sufficiently long timescales), 
the Lempel-Ziv algorithm can be proven asymptotically to compress down to 
the entropy of the source. This is why it is called a 'universal' compression 
algorithm. For a proof of this property, see Cover and Thomas (1991). 

It achieves its compression, however, only by memorizing substrings that 
have happened so that it has a short name for them the next time they occur. 
The asymptotic timescale on which this universal performance is achieved may, 
for many sources, be unfeasibly long, because the number of typical substrings 
that need memorizing may be enormous. The useful performance of the al- 
gorithm in practice is a reflection of the fact that many files contain multiple 
repetitions of particular short sequences of characters, a form of redundancy 
to which the algorithm is well suited. 
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6.5: Demonstration 
Common ground 

I have emphasized the difference in philosophy behind arithmetic coding and 
Lempel-Ziv coding. There is common ground between them, though: in prin- 
ciple, one can design adaptive probabilistic models, and thence arithmetic 
codes, that are 'universal', that is, models that will asymptotically compress 
any source in some class to within some factor (preferably 1) of its entropy. 
However, for practical purposes, I think such universal models can only be 
constructed if the class of sources is severely restricted. A general purpose 
compressor that can discover the probability distribution of any source would 
be a general purpose artificial intelligence! A general purpose artificial intelli- 
gence does not yet exist. 

► 6.5 Demonstration 

An interactive aid for exploring arithmetic coding, dasher. tcl, is available. 2 
A demonstration arithmetic-coding software package written by Radford 
Neal 3 consists of encoding and decoding modules to which the user adds a 
module defining the probabilistic model. It should be emphasized that there 
is no single general-purpose arithmetic-coding compressor; a new model has to 
be written for each type of source. Radford Neal's package includes a simple 
adaptive model similar to the Bayesian model demonstrated in section 6.2. 
The results using this Laplace model should be viewed as a basic benchmark 
since it is the simplest possible probabilistic model - it simply assumes the 
characters in the file come independently from a fixed ensemble. The counts 
of the symbols {a^} are rescaled and rounded as the file is read such that 
all the counts lie between 1 and 256. 

A state-of-the-art compressor for documents containing text and images, 
DjVu, uses arithmetic coding. 4 It uses a carefully designed approximate arith- 
metic coder for binary alphabets called the Z-coder (Bottou et ah, 1998), which 
is much faster than the arithmetic coding software described above. One of 
the neat tricks the Z-coder uses is this: the adaptive model adapts only occa- 
sionally (to save on computer time), with the decision about when to adapt 
being pseudo-randomly controlled by whether the arithmetic encoder emitted 
a bit. 

The JBIG image compression standard for binary images uses arithmetic 
coding with a context-dependent model, which adapts using a rule similar to 
Laplace's rule. PPM (Teahan, 1995) is a leading method for text compression, 
and it uses arithmetic coding. 

There are many Lempel-Ziv-based programs, gzip is based on a version 
of Lempel-Ziv called 'LZ77' (Ziv and Lempel, 1977). compress is based on 
'LZW (Welch, 1984). In my experience the best is gzip, with compress being 
inferior on most files. 

bzip is a block-sorting file compressor, which makes use of a neat hack 
called the Burrows-Wheeler transform (Burrows and Wheeler, 1994). This 
method is not based on an explicit probabilistic model, and it only works well 
for files larger than several thousand characters; but in practice it is a very 
effective compressor for files in which the context of a character is a good 
predictor for that character. 5 

2 http : //www. inference .phy . cam. ac . uk/mackay/itprnn/ sof twarel .html 
3 f tp : //ftp . cs . toronto . edu/pub/radf ord/www/ac . sof tware .html 
4 http : //www . dj vuzone . org/ 

5 There is a lot of information about the Burrows-Wheeler transform on the net. 
http://dogma.net/DataCompression/BWT. shtml 
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Compression of a text file 

Table 6.6 gives the computer time in seconds taken and the compression 
achieved when these programs are applied to the IATgX file containing the 
text of this chapter, of size 20,942 bytes. 



Method 


Compression 


Compressed size 


Uncompression 




time / sec 


(%age of 20,942) 


time / sec 


Laplace model 


0.28 


12 974 (61%) 


0.32 


gzip 


0.10 


8 177 (39%) 


0.01 


compress 


0.05 


10 816 (51%) 


0.05 


bzip 




7495 (36%) 




bzip2 




7640 (36%) 




ppmz 




6 800 (32%) 





Table 6.6. Comparison of 
compression algorithms applied to 
a text file. 



Compression of a sparse file 

Interestingly, gzip does not always do so well. Table 6.7 gives the compres- 
sion achieved when these programs are applied to a text file containing 10 6 
characters, each of which is either and 1 with probabilities 0.99 and 0.01. 
The Laplace model is quite well matched to this source, and the benchmark 
arithmetic coder gives good performance, followed closely by compress; gzip 
is worst. An ideal model for this source would compress the file into about 
10 6 if 2 (0.01)/8 ~ 10100 bytes. The Laplace model compressor falls short of 
this performance because it is implemented using only eight-bit precision. The 
ppmz compressor compresses the best of all, but takes much more computer 
time. 



Method 


Compression 


Compressed size 


Uncompression 




time / sec 


/ bytes 


time / sec 


Laplace model 


0.45 


14143 (1.4%) 


0.57 


gzip 


0.22 


20 646 (2.1%) 


0.04 


gzip — best+ 


1.63 


15 553 (1.6%) 


0.05 


compress 


0.13 


14 785 (1.5%) 


0.03 


bzip 


0.30 


10 903 (1.09%) 


0.17 


bzip2 


0.19 


11260 (1.12%) 


0.05 


ppmz 


533 


10447 (1.04%) 


535 



Table 6.7. Comparison of 
compression algorithms applied to 
a random file of 10 6 characters, 
99% 0s and 1% Is. 



► 6.6 Summary 

In the last three chapters we have studied three classes of data compression 
codes. 

Fixed- length block codes (Chapter 4). These are mappings from a fixed 
number of source symbols to a fixed-length binary message. Only a tiny 
fraction of the source strings are given an encoding. These codes were 
fun for identifying the entropy as the measure of compressibility but they 
are of little practical use. 
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6. 7: Exercises on stream codes 

Symbol codes (Chapter 5). Symbol codes employ a variable-length code for 
each symbol in the source alphabet, the codelengths being integer lengths 
determined by the probabilities of the symbols. Huffman's algorithm 
constructs an optimal symbol code for a given set of symbol probabilities. 

Every source string has a uniquely decodeable encoding, and if the source 
symbols come from the assumed distribution then the symbol code will 
compress to an expected length L lying in the interval [H,H+1). Sta- 
tistical fluctuations in the source may make the actual length longer or 
shorter than this mean length. 

If the source is not well matched to the assumed distribution then the 
mean length is increased by the relative entropy Dkl between the source 
distribution and the code's implicit distribution. For sources with small 
entropy, the symbol has to emit at least one bit per source symbol; 
compression below one bit per source symbol can only be achieved by 
the cumbersome procedure of putting the source data into blocks. 

Stream codes. The distinctive property of stream codes, compared with 
symbol codes, is that they are not constrained to emit at least one bit for 
every symbol read from the source stream. So large numbers of source 
symbols may be coded into a smaller number of bits. This property 
could only be obtained using a symbol code if the source stream were 
somehow chopped into blocks. 

• Arithmetic codes combine a probabilistic model with an encoding 
algorithm that identifies each string with a sub-interval of [0, 1) of 
size equal to the probability of that string under the model. This 
code is almost optimal in the sense that the compressed length of a 
string x closely matches the Shannon information content of x given 
the probabilistic model. Arithmetic codes fit with the philosophy 
that good compression requires data modelling, in the form of an 
adaptive Bayesian model. 

• Lempel-Ziv codes are adaptive in the sense that they memorize 
strings that have already occurred. They are built on the philoso- 
phy that we don't know anything at all about what the probability 
distribution of the source will be, and we want a compression algo- 
rithm that will perform reasonably well whatever that distribution 
is. 

Both arithmetic codes and Lempel-Ziv codes will fail to decode correctly 
if any of the bits of the compressed file are altered. So if compressed files are 
to be stored or transmitted over noisy media, error-correcting codes will be 
essential. Reliable communication over unreliable channels is the topic of Part 
II. 

► 6.7 Exercises on stream codes 

■ck '^i Exercise 6.7. L J Describe an arithmetic coding algorithm to encode random bit 
strings of length N and weight K (i.e., K ones and N — K zeroes) where 
N and K are given. 

For the case TV = 5, K = 2 show in detail the intervals corresponding to 
all source substrings of lengths 1-5. 

> Exercise 6.8. p ' 128] How many bits are needed to specify a selection of K 
objects from N objects? (N and K are assumed to be known and the 
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selection of K objects is unordered.) How might such a selection be 
made at random without being wasteful of random bits? 

> Exercise 6.9. ^ A binary source X emits independent identically distributed 

symbols with probability distribution {/o,/i}, where f\ = 0.01. Find 
an optimal uniquely-decodeable symbol code for a string x = x\X2X% of 
three successive samples from this source. 

Estimate (to one decimal place) the factor by which the expected length 
of this optimal code is greater than the entropy of the three-bit string x. 

[#2(0.01) ~ 0.08, where H 2 (x) = :rlog 2 (l/x) + (1 - x)log 2 (l/(l - x)).] 

An arithmetic code is used to compress a string of 1000 samples from 
the source X. Estimate the mean and standard deviation of the length 
of the compressed file. 

> Exercise 6.10.^ Describe an arithmetic coding algorithm to generate random 

bit strings of length N with density / (i.e., each bit has probability / of 
being a one) where N is given. 

Exercise 6.11.'- 2 -' Use a modified Lempel-Ziv algorithm in which, as discussed 
on p. 120, the dictionary of prefixes is pruned by writing new prefixes 
into the space occupied by prefixes that will not be needed again. 
Such prefixes can be identified when both their children have been 
added to the dictionary of prefixes. (You may neglect the issue of 
termination of encoding.) Use this algorithm to encode the string 
0100001000100010101000001. Highlight the bits that follow a prefix 
on the second occasion that that prefix is used. (As discussed earlier, 
these bits could be omitted.) 

Exercise 6.12. P- 128 l Show that this modified Lempel-Ziv code is still not 
'complete', that is, there are binary strings that are not encodings of 
any string. 

> Exercise 6.13. ^' p - 128 l Give examples of simple sources that have low entropy 

but would not be compressed well by the Lempel-Ziv algorithm. 

► 6.8 Further exercises on data compression 

The following exercises may be skipped by the reader who is eager to learn 
about noisy channels. 

Exercise 6.14} 3 ' P- 130 l Consider a Gaussian distribution in N dimensions, 





(6.13) 




Estimate the mean 



and variance of the square of the radius, r 2 



(EX)- 



You may find helpful the integral 




(6.14) 



though you should be able to estimate the required quantities without it. 
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Assuming that N is large, show that nearly all the probability of a 
Gaussian is contained in a thin shell of radius yNcr. Find the thickness 
of the shell. 

Evaluate the probability density (6.13) at a point in that thin shell and 
at the origin x = and compare. Use the case N = 1000 as an example. 

Notice that nearly all the probability mass is located in a different part 
of the space from the region of highest probability density. 

Exercise 6.15.^ Explain what is meant by an optimal binary symbol code. 
Find an optimal binary symbol code for the ensemble: 

A = {a,b,c,d,e,f,g,h, i, j}, 

1 2 4 5 6 8 9 10 25 30 

Too' Too ' Too' Too' Too ' Too' Too ' Too' nw'Too 

and compute the expected length of the code. 

Exercise 6.16.^ A string y = x\X2 consists of two independent samples from 
an ensemble 



X : Ax = {^b,c};P x 



3 6 

To' To 



What is the entropy of y? Construct an optimal binary symbol code for 
the string y, and find its expected length. 

Exercise 6.17.^ Strings of N independent samples from an ensemble with 
V = {0.1, 0.9} are compressed using an arithmetic code that is matched 
to that ensemble. Estimate the mean and standard deviation of the 
compressed strings' lengths for the case N = 1000. [if 2 (0.1) — 0.47] 

Exercise 6.18.^ Source coding with variable-length symbols. 

In the chapters on source coding, we assumed that we were 
encoding into a binary alphabet {0, 1} in which both symbols 
should be used with equal frequency. In this question we ex- 
plore how the encoding alphabet should be used if the symbols 
take different times to transmit. 

A poverty-stricken student communicates for free with a friend using a 
telephone by selecting an integer n € {1,2,3...}, making the friend's 
phone ring n times, then hanging up in the middle of the nth ring. This 
process is repeated so that a string of symbols n\n2n% . . . is received. 
What is the optimal way to communicate? If large integers n are selected 
then the message takes longer to communicate. If only small integers n 
are used then the information content per symbol is small. We aim to 
maximize the rate of information transfer, per unit time. 

Assume that the time taken to transmit a number of rings n and to 
redial is l n seconds. Consider a probability distribution over n, {p n }. 
Defining the average duration per symbol to be 




probability density 
is maximized here 



y 

almost all 
probability mass is here 

Figure 6.8. Schematic 
representation of the typical set of 
an iV-dimcnsional Gaussian 
distribution. 



L(P) = ^PJn 

n 



(6.15) 
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and the entropy per symbol to be 

#(p) = Vp n log 2 — , (6.16) 

show that for the average information rate per second to be maximized, 
the symbols must be used with probabilities of the form 

Pn = ^2-^" (6.17) 

where Z = 2~P ln and (3 satisfies the implicit equation 

»-m- ,618 > 

that is, (3 is the rate of communication. Show that these two equations 
(6.17, 6.18) imply that (3 must be set such that 

logZ = 0. (6.19) 

Assuming that the channel has the property 

l n = n seconds, (6.20) 

find the optimal distribution p and show that the maximal information 
rate is 1 bit per second. 

How does this compare with the information rate per second achieved if 
p is set to (1/2, 1/2, 0, 0, 0, 0, . . .) — that is, only the symbols n = 1 and 
n = 2 are selected, and they have equal probability? 

Discuss the relationship between the results (6.17, 6.19) derived above, 
and the Kraft inequality from source coding theory. 

How might a random binary source be efficiently encoded into a se- 
quence of symbols ri\ri2n^ ... for transmission over the channel defined 
in equation (6.20)? 

> Exercise 6.19.^ How many bits does it take to shuffle a pack of cards? 

> Exercise 6.20. ^ In the card game Bridge, the four players receive 13 cards 

each from the deck of 52 and start each game by looking at their own 
hand and bidding. The legal bids are, in ascending order ljfc, 1<0>, IV, 
INT, 2*, 2<0>, ... 79,74,77VT, and successive bids must follow this 
order; a bid of, say, 2^ may only be followed by higher bids such as 2<(k 
or 'AX or INT. (Let us neglect the 'double' bid.) 

The players have several aims when bidding. One of the aims is for two 
partners to communicate to each other as much as possible about what 
cards are in their hands. 

Let us concentrate on this task. 

(a) After the cards have been dealt, how many bits are needed for North 
to convey to South what her hand is? 

(b) Assuming that E and W do not bid at all, what is the maximum 
total information that N and S can convey to each other while 
bidding? Assume that N starts the bidding, and that once either 
N or S stops bidding, the bidding stops. 
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> Exercise 6.21.^ My old 'arabic' microwave oven had 11 buttons for entering 
cooking times, and my new 'roman' microwave has just five. The but- 
tons of the roman microwave are labelled '10 minutes', '1 minute', '10 
seconds', '1 second', and 'Start'; I'll abbreviate these five strings to the 
symbols M, C, X, I, □. To enter one minute and twenty-three seconds 
(1:23), the arabic sequence is 



Arabic 



Roman 



and the roman sequence is 



123d, 



cxxiiid. 



(6.21) 



(6.22) 



Each of these keypads defines a code mapping the 3599 cooking times 
from 0:01 to 59:59 into a string of symbols. 

(a) Which times can be produced with two or three symbols? (For 
example, 0:20 can be produced by three symbols in either code: 
XXD and 20D.) 

(b) Are the two codes complete? Give a detailed answer. 

(c) For each code, name a cooking time that it can produce in four 
symbols that the other code cannot. 

(d) Discuss the implicit probability distributions over times to which 
each of these codes is best matched. 

(e) Concoct a plausible probability distribution over times that a real 
user might use, and evaluate roughly the expected number of sym- 
bols, and maximum number of symbols, that each code requires. 
Discuss the ways in which each code is inefficient or efficient. 

(f) Invent a more efficient cooking-time-encoding system for a mi- 
crowave oven. 

Exercise 6.22. ^' P- 132 l l s the standard binary representation for positive inte- 
gers (e.g. Cb(5) = 101) a uniquely decodeable code? 

Design a binary code for the positive integers, i.e., a mapping from 
n G {1,2,3,...} to c(n) € {0, l} + , that is uniquely decodeable. Try 
to design codes that are prefix codes and that satisfy the Kraft equality 

Motivations: any data file terminated by a special end of file character 
can be mapped onto an integer, so a prefix code for integers can be used 
as a self-delimiting encoding of files too. Large files correspond to large 
integers. Also, one of the building blocks of a 'universal' coding scheme - 
that is, a coding scheme that will work OK for a large variety of sources 
- is the ability to encode integers. Finally, in microwave ovens, cooking 
times are positive integers! 

Discuss criteria by which one might compare alternative codes for inte- 
gers (or, equivalently, alternative self-delimiting codes for files). 







Figure 6.9. Alternative keypads 
for microwave ovens. 



► 6.9 Solutions 

Solution to exercise 6.1 (p. 115). The worst-case situation is when the interval 
to be represented lies just inside a binary interval. In this case, we may choose 
either of two binary intervals as shown in figure 6.10. These binary intervals 
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Source string's interval Binary intervals 



P(x|W) 



Figure 6.10. Termination of 
arithmetic coding in the worst 
case, where there is a two bit 
overhead. Either of the two 
binary intervals marked on the 
right-hand side may be chosen. 
These binary intervals arc no 
smaller than P(x\H)/4. 



are no smaller than P(x|H)/4, so the binary encoding has a length no greater 
than log 2 l/P(x\H) + log 2 4, which is two bits more than the ideal message 
length. 

Solution to exercise 6.3 (p. 118). The standard method uses 32 random bits 
per generated symbol and so requires 32 000 bits to generate one thousand 
samples. 

Arithmetic coding uses on average about -ff2(0.01) = 0.081 bits per gener- 
ated symbol, and so requires about 83 bits to generate one thousand samples 
(assuming an overhead of roughly two bits associated with termination). 

Fluctuations in the number of Is would produce variations around this 
mean with standard deviation 21. 

Solution to exercise 6.5 (p. 120). The encoding is 010100110010110001100, 

which comes from the parsing 

0, 00, 000, 0000, 001, 00000, 000000 (6.23) 

which is encoded thus: 

(, 0), (1, 0), (10, 0), (11, 0), (010, 1), (100, 0), (110, 0). (6.24) 

Solution to exercise 6.6 (p. 120). The decoding is 

0100001000100010101000001. 

Solution to exercise 6.8 (p. 123). This problem is equivalent to exercise 6.7 
(p. 123). 

The selection of K objects from N objects requires [log 2 (^)l bits ~ 
NH 2 (K/N) bits. This selection could be made using arithmetic coding. The 
selection corresponds to a binary string of length N in which the 1 bits rep- 
resent which objects are selected. Initially the probability of a 1 is K/N and 
the probability of a is (N—K)/N. Thereafter, given that the emitted string 
thus far, of length n, contains k Is, the probability of a 1 is (K—k)/(N — n) 
and the probability of a is 1 — (K — k)/(N — n). 

Solution to exercise 6.12 (p. 124). This modified Lempel-Ziv code is still not 
'complete', because, for example, after five prefixes have been collected, the 
pointer could be any of the strings 000, 001, 010, 011, 100, but it cannot be 
101, 110 or 111. Thus there are some binary strings that cannot be produced 
as encodings. 



Solution to exercise 6.13 (p. 124). Sources with low entropy that are not well 
compressed by Lempel-Ziv include: 
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(a) Sources with some symbols that have long range correlations and inter- 
vening random junk. An ideal model should capture what's correlated 
and compress it. Lempel-Ziv can only compress the correlated features 
by memorizing all cases of the intervening junk. As a simple example, 
consider a telephone book in which every line contains an (old number, 
new number) pair: 

285-3820 :572-5892D 
258-8302 :593-2010D 

The number of characters per line is 18, drawn from the 13-character 
alphabet {0, 1, . . . , 9, — , :, □}. The characters ' : ' and '□' occur in a 
predictable sequence, so the true information content per line, assuming 
all the phone numbers are seven digits long, and assuming that they are 
random sequences, is about 14 bans. (A ban is the information content of 
a random integer between and 9.) A finite state language model could 
easily capture the regularities in these data. A Lempel-Ziv algorithm 
will take a long time before it compresses such a file down to 14 bans 
per line, however, because in order for it to 'learn' that the string : ddd 
is always followed by -, for any three digits ddd, it will have to see all 
those strings. So near-optimal compression will only be achieved after 
thousands of lines of the file have been read. 





Figure 6.11. A source with low entropy that is not well compressed by Lempel-Ziv. The bit sequence 
is read from left to right. Each line differs from the line above in / = 5% of its bits. The 
image width is 400 pixels. 

(b) Sources with long range correlations, for example two-dimensional im- 
ages that are represented by a sequence of pixels, row by row, so that 
vertically adjacent pixels are a distance w apart in the source stream, 
where w is the image width. Consider, for example, a fax transmission in 
which each line is very similar to the previous line (figure 6.11). The true 
entropy is only i?2(/) per pixel, where / is the probability that a pixel 
differs from its parent. Lempel-Ziv algorithms will only compress down 
to the entropy once all strings of length 2 W = 2 400 have occurred and 
their successors have been memorized. There are only about 2 300 par- 
ticles in the universe, so we can confidently say that Lempel-Ziv codes 
will never capture the redundancy of such an image. 

Another highly redundant texture is shown in figure 6.12. The image was 
made by dropping horizontal and vertical pins randomly on the plane. It 
contains both long-range vertical correlations and long-range horizontal 
correlations. There is no practical way that Lempel-Ziv, fed with a 
pixel-by-pixel scan of this image, could capture both these correlations. 

Biological computational systems can readily identify the redundancy in 
these images and in images that are much more complex; thus we might 
anticipate that the best data compression algorithms will result from the 
development of artificial intelligence methods. 
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Figure 6.12. A texture consisting of horizontal and vertical pins dropped at random on the plane. 



(c) Sources with intricate redundancy, such as files generated by computers. 
For example, a MpjX file followed by its encoding into a PostScript 
file. The information content of this pair of files is roughly equal to the 
information content of the WFf$t file alone. 

(d) A picture of the Mandelbrot set. The picture has an information content 
equal to the number of bits required to specify the range of the complex 
plane studied, the pixel sizes, and the colouring rule used. 

(e) A picture of a ground state of a frustrated antiferromagnetic Ising model 
(figure 6.13), which we will discuss in Chapter 31. Like figure 6.12, this 
binary image has interesting correlations in two directions. 




Figure 6.13. Frustrated triangular 
Ising model in one of its ground 
states. 



(f) Cellular automata - figure 6.14 shows the state history of 100 steps of 
a cellular automaton with 400 cells. The update rule, in which each 
cell's new state depends on the state of five preceding cells, was selected 
at random. The information content is equal to the information in the 
boundary (400 bits), and the propagation rule, which here can be de- 
scribed in 32 bits. An optimal compressor will thus give a compressed file 
length which is essentially constant, independent of the vertical height of 
the image. Lempel-Ziv would only give this zero-cost compression once 
the cellular automaton has entered a periodic limit cycle, which could 
easily take about 2 100 iterations. 

In contrast, the JBIG compression method, which models the probability 
of a pixel given its local context and uses arithmetic coding, would do a 
good job on these images. 

Solution to exercise 6.14 (p. 124). For a one-dimensional Gaussian, the vari- 
ance of x, £[x 2 }, is a 2 . So the mean value of r 2 in N dimensions, since the 
components of x are independent random variables, is 



£\r 2 ] = Na 2 . 



(6.25) 
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Figure 6.14. The 100-step time-history of a cellular automaton with 400 cells. 



The variance of r 2 , similarly, is N times the variance of x 2 , where i is a 
one-dimensional Gaussian variable. 

var(, 2 ) = fdx ^-L^exp (_JL^ - o\ (6.26) 

The integral is found to be 3cr 4 (equation (6.14)), so var(x 2 ) = 2a 4 . Thus the 
variance of r 2 is 2Na A . 

For large N, the central-limit theorem indicates that r 2 has a Gaussian 
distribution with mean Na 2 and standard deviation \/2N a 2 , so the probability 
density of r must similarly be concentrated about r ~ y/~N a. 

The thickness of this shell is given by turning the standard deviation 
of r 2 into a standard deviation on r: for small 6r/r, Slogr = Sr/r = 
(V2)51ogr 2 = (V2)5(r 2 )/r 2 , so setting 5(r 2 ) = \/2Na 2 , r has standard de- 
viation Sr = { l l2)r5{r 2 )/r 2 = a/y/2. 

The probability density of the Gaussian at a point x s h e ii where r = y/W a 

is 

Whereas the probability density at the origin is 

P(x=0) = odr^ (6 - 28) 



Thus P(x s heii)/-P( x = 0) = exp {—N/2) . The probability density at the typical 
radius is e~ N / 2 times smaller than the density at the origin. If TV = 1000, then 
the probability density at the origin is e 500 times greater. 
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This chapter is an aside, which may safely be skipped. 

Solution to exercise 6.22 (p. 127) 

To discuss the coding of integers we need some definitions. 

The standard binary representation of a positive integer n will be 
denoted by c b (n), e.g., c b (5) = 101, c b (45) = 101101. 

The standard binary length of a positive integer n, Z b (n), is the 
length of the string c b (n). For example, Z b (5) = 3, / b (45) = 6. 

The standard binary representation c b (n) is not a uniquely decodeable code 
for integers since there is no way of knowing when an integer has ended. For 
example, c b (5)c b (5) is identical to c b (45). It would be uniquely decodeable if 
we knew the standard binary length of each integer before it was received. 

Noticing that all positive integers have a standard binary representation 
that starts with a 1, we might define another representation: 

The headless binary representation of a positive integer n will be de- 
noted by ce(n), e.g., cb(5) = 01, cb(45) = 01101 and cb(1) = A (where 
A denotes the null string). 

This representation would be uniquely decodeable if we knew the length Z b (n) 
of the integer. 

So, how can we make a uniquely decodeable code for integers? Two strate- 
gies can be distinguished. 

1. Self-delimiting codes. We first communicate somehow the length of 
the integer, Z b (n), which is also a positive integer; then communicate the 
original integer n itself using cb(«). 

2. Codes with 'end of file' characters. We code the integer into blocks 
of length b bits, and reserve one of the 2 h symbols to have the special 
meaning 'end of file'. The coding of integers into blocks is arranged so 
that this reserved symbol is not needed for any other purpose. 

The simplest uniquely decodeable code for integers is the unary code, which 
can be viewed as a code with an end of file character. 




Codes for Integers 
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7 — Codes for Integers 

Unary code. An integer n is encoded by sending a string of n— 1 Os followed 
by a 1. 

n crj(n) 

1 1 

2 01 

3 001 

4 0001 

5 00001 

45 000000000000000000000000000000000000000000001 

The unary code has length l\j(n) = n. 

The unary code is the optimal code for integers if the probability distri- 
bution over n is p\j(n) = 2~ n . 

Self- delimiting codes 

We can use the unary code to encode the length of the binary encoding of n 
and make a self-delimiting code: 

Code C a . We send the unary code for l b {n), followed by the headless binary 
representation of n. 

c a (n) = cu[l h (n)]c B (n). (7.1) 
Table 7.1 shows the codes for some integers. The overlining indicates 
the division of each string into the parts crj[£b(^)] an d c B {n). We might 
equivalently view c a {n) as consisting of a string of (lb{n) — 1) zeroes 
followed by the standard binary representation of n, c\,(n). 

The codeword c a (n) has length l a (n) = 2l b (n) — 1. 

The implicit probability distribution over n for the code C a is separable 
into the product of a probability distribution over the length I, 

P(l) = 2~ l , (7.2) 

and a uniform distribution over integers having that length, 

P(n\l) = { 2 ~' +1 lh{n) = l (7 3) 

| otherwise. 

Now, for the above code, the header that communicates the length always 
occupies the same number of bits as the standard binary representation of 
the integer (give or take one). If we are expecting to encounter large integers 
(large files) then this representation seems suboptimal, since it leads to all files 
occupying a size that is double their original uncoded size. Instead of using 
the unary code to encode the length lh(n), we could use C a . 

Code Cp. We send the length l\,(n) using C a , followed by the headless binary 
representation of n. 

cp(n) = c a [l h (n)}c B (n). (7.4) 
Iterating this procedure, we can define a sequence of codes. 
Code C r 

c 7 (n) = cp[l b (n)]cB(n). (7.5) 
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n Cb(n) lb{n) c a (n) 



1 


1 


1 


T 


2 


10 


2 


010 


3 


11 


2 


011 


4 


100 


3 


ooToo 


5 


101 


3 


ooToi 


6 


110 


3 


ooTio 



45 101101 6 00000101101 



Table 7.1. C a . 



n cp(n) c 7 (n) 



1 


T 


T 


2 


oToo 


oTooo 


3 


oToi 


oTooi 


4 


oTToo 


oToToo 


5 


oTToi 


oToToi 


6 


oTTio 


oToTio 



45 0011001101 0111001101 



Table 7.2. C*g and C*. 



Code Cj. 



c 5 (n) = c 7 [/ b (n)]c B (n). 



(7.6) 
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Codes with end-of-file symbols 

We can also make byte-based representations. (Let's use the term byte flexibly 
here, to denote any fixed-length string of bits, not just a string of length 8 
bits.) If we encode the number in some base, for example decimal, then we 
can represent each digit in a byte. In order to represent a digit from to 9 in a 
byte we need four bits. Because 2 4 = 16, this leaves 6 extra four-bit symbols, 
{1010, 1011, 1100, 1101, 1110, 1111}, that correspond to no decimal digit. 
We can use these as end-of-file symbols to indicate the end of our positive 
integer. 

Clearly it is redundant to have more than one end-of-file symbol, so a more 
efficient code would encode the integer into base 15, and use just the sixteenth 
symbol, 1111, as the punctuation character. Generalizing this idea, we can 
make similar byte-based codes for integers in bases 3 and 7, and in any base 
of the form 2™ - 1. 

These codes are almost complete. (Recall that a code is 'complete' if it 
satisfies the Kraft inequality with equality.) The codes' remaining inefficiency 
is that they provide the ability to encode the integer zero and the empty string, 
neither of which was required. 

l> Exercise 7.lS 2 ' P 136 l Consider the implicit probability distribution over inte- 
gers corresponding to the code with an end-of-file character. 

(a) If the code has eight-bit blocks (i.e., the integer is coded in base 
255), what is the mean length in bits of the integer, under the 
implicit distribution? 

(b) If one wishes to encode binary files of expected size about one hun- 
dred kilobytes using a code with an end-of-file character, what is 
the optimal block size? 



n 



C3{n) 



c 7 (n 



1 01 11 

2 10 11 

3 0100 11 



001 111 

010 111 

011 111 



45 01 10000011 110011 111 



Table 7.3. Two codes with 
cnd-of-filc symbols, C3 and C7. 
Spaces have been included to 
show the byte boundaries. 



Encoding a tiny file 

To illustrate the codes we have discussed, we now use each code to encode a 
small file consisting of just 14 characters, 



Claude Shannon 



• If we map the ASCII characters onto seven-bit symbols (e.g., in decimal, 
C = 67, 1 = 108, etc.), this 14 character file corresponds to the integer 



n - 



1 67 987 786 364 950 89 1 085 602 469 870 (decimal) . 



The unary code for n consists of this many (less one) zeroes, followed by 
a one. If all the oceans were turned into ink, and if we wrote a hundred 
bits with every cubic millimeter, there might be enough ink to write 
crj(n). 

The standard binary representation of n is this length- 98 sequence of 
bits: 

c b (n) = 1000011110110011000011110101110010011001010100000 
1010011110100011000011101110110111011011111101110. 



i> Exercise 7.2.^ Write down or describe the following self-delimiting represen- 
tations of the above number n: c a (n), cp(n), c 7 (n), cg(ri), c^(n), £7(71), 
and Ci§{n). Which of these encodings is the shortest? [Answer: C15.] 
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Comparing the codes 

One could answer the question 'which of two codes is superior?' by a sentence 
of the form 'For n > k, code 1 is superior, for n < k, code 2 is superior' but I 
contend that such an answer misses the point: any complete code corresponds 
to a prior for which it is optimal; you should not say that any other code is 
superior to it. Other codes are optimal for other priors. These implicit priors 
should be thought about so as to achieve the best code for one's application. 

Notice that one cannot, for free, switch from one code to another, choosing 
whichever is shorter. If one were to do this, then it would be necessary to 
lengthen the message in some way that indicates which of the two codes is 
being used. If this is done by a single leading bit, it will be found that the 
resulting code is suboptimal because it fails the Kraft equality, as was discussed 
in exercise 5.33 (p. 104). 

Another way to compare codes for integers is to consider a sequence of 
probability distributions, such as monotonic probability distributions over n > 
1, and rank the codes as to how well they encode any of these distributions. 
A code is called a 'universal' code if for any distribution in a given class, it 
encodes into an average length that is within some factor of the ideal average 
length. 

Let me say this again. We are meeting an alternative world view - rather 
than figuring out a good prior over integers, as advocated above, many the- 
orists have studied the problem of creating codes that are reasonably good 
codes for any priors in a broad class. Here the class of priors convention- 
ally considered is the set of priors that (a) assign a monotonically decreasing 
probability over integers and (b) have finite entropy. 

Several of the codes we have discussed above are universal. Another code 
which elegantly transcends the sequence of self-delimiting codes is Elias's 'uni- 
versal code for integers' (Elias, 1975), which effectively chooses from all the 

codes C a ,Cp, It works by sending a sequence of messages each of which 

encodes the length of the next message, and indicates by a single bit whether 
or not that message is the final integer (in its standard binary representation). 
Because a length is a positive integer and all positive integers begin with '1', 
all the leading Is can be omitted. 



Write '0' 
Loop { 

If [lognj = halt 

Prepend Cb(n) to the written string 

n:= [log n\ 

} 



Algorithm 7.4. Elias's encoder for 
an integer n. 



The encoder of C w is shown in algorithm 7.4. The encoding is generated 
from right to left. Table 7.5 shows the resulting codewords. 



i> Exercise 7.3.'- ' Show that the Elias code is not actually the best code for a 
prior distribution that expects very large integers. (Do this by construct- 
ing another code and specifying how large n must be for your code to 
give a shorter length than Elias's.) 
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7 — Codes for Integers 



n 


c t ,,(n) 


n 




n 


C(,,(n) 


n 


l -W \ / 


1 





31 


10100111110 


9 


1110010 


256 


1110001000000000 


2 


100 


32 


101011000000 


10 


1110100 


365 


1110001011011010 


3 


110 


45 


101011011010 


11 


1110110 


511 


1110001111111110 


4 


101000 


63 


101011111110 


12 


1111000 


512 


11100110000000000 


5 


101010 


64 


1011010000000 


13 


1111010 


719 


11100110110011110 


6 


101100 


127 


1011011111110 


14 


1111100 


1023 


11100111111111110 


7 


101110 


128 


10111100000000 


15 


1111110 


1024 


111010100000000000 


8 


1110000 


255 


10111111111110 


16 


10100100000 


1025 


111010100000000010 



Solutions 

Solution to exercise 7.1 (p. 134). The use of the end-of-file symbol in a code 
that represents the integer in some base q corresponds to a belief that there is 
a probability of (l/(q+ 1)) that the current character is the last character of 
the number. Thus the prior to which this code is matched puts an exponential 
prior distribution over the length of the integer. 



Table 7.5. Elias's 'universal' code 
for integers. Examples from 1 to 
1025. 



(a) The expected number of characters is q+ 1 
of the integer is 256 x 8 ~ 2000 bits. 



256, so the expected length 



(b) We wish to find q such that qlogq ~ 800 000 bits. A value of q between 
2 15 and 2 16 satisfies this constraint, so 16-bit blocks are roughly the 
optimal size, assuming there is one end-of-file character. 
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Part II 

Noisy-Channel Coding 
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8 



Dependent Random Variables 



In the last three chapters on data compression we concentrated on random 
vectors x coming from an extremely simple probability distribution, namely 
the separable distribution in which each component x n is independent of the 
others. 

In this chapter, we consider joint ensembles in which the random variables 
are dependent. This material has two motivations. First, data from the real 
world have interesting correlations, so to do data compression well, we need 
to know how to work with models that include dependences. Second, a noisy 
channel with input x and output y defines a joint ensemble in which x and y are 
dependent - if they were independent, it would be impossible to communicate 
over the channel - so communication over noisy channels (the topic of chapters 
9-11) is described in terms of the entropy of joint ensembles. 

► 8.1 More about entropy 

This section gives definitions and exercises to do with entropy, carrying on 
from section 2.4. 

The joint entropy of X, Y is: 

H(X,Y)= £ P M log-i-. (8.1) 

xyeAxA Y { ,yJ 

Entropy is additive for independent random variables: 

H(X, Y) = H(X) + H(Y) iff P(x, y) = P(x)P(y). (8.2) 

The conditional entropy of X given y = is the entropy of the proba- 
bility distribution P(x \ y = b k ). 

H(X\y = b k ) = £ P(*\y = h)lo g 1 (8.3) 

xeAx v 1 ' 

The conditional entropy of X given Y is the average, over y, of the con- 
ditional entropy of X given y. 



H(X\Y) = £ P(y) 



yeA Y 



V P(x\y) log / 



xeA x 



= £ p (*>v)^p£M- (8 - 4) 

xyeAxAy 

This measures the average uncertainty that remains about x when y is 
known. 
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8.1: More about entropy 



The marginal entropy of X is another name for the entropy of X, H(X), 
used to contrast it with the conditional entropies listed above. 

Chain rule for information content. From the product rule for probabil- 
ities, equation (2.6), we obtain: 

log pkv) = log lk) +log WW) (8 - 5) 

so 

h(x,y) = h(x) + h(y\x). (8.6) 

In words, this says that the information content of x and y is the infor- 
mation content of x plus the information content of y given x. 

Chain rule for entropy. The joint entropy, conditional entropy and 
marginal entropy are related by: 

H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y). (8.7) 

In words, this says that the uncertainty of X and Y is the uncertainty 
of X plus the uncertainty of Y given X. 



The mutual information between X and Y is 



I(X;Y) = H(X)-H(X\Y), (8.8) 

and satisfies I(X;Y) = I(Y;X), and I(X;Y) > 0. It measures the 
average reduction in uncertainty about x that results from learning the 
value of y; or vice versa, the average amount of information that x 
conveys about y. 

The conditional mutual information between X and Y given z = c k 

is the mutual information between the random variables X and Y in 
the joint ensemble P(x, y\z = Cfc), 

I(X; Y\z = c k ) = H(X \z = c k )- H(X \Y,z = c k ). (8.9) 



The conditional mutual information between X and Y given Z is 

the average over z of the above conditional mutual information. 

I(X;Y\Z) = H(X\Z) - H(X\Y : Z). (8.10) 

No other 'three-term entropies' will be defined. For example, expres- 
sions such as I(X; Y; Z) and I(X \ Y; Z) are illegal. But you may put 
conjunctions of arbitrary numbers of variables in each of the three spots 
in the expression I(X; Y \Z) - for example, I(A, B;C,D\ E, F) is fine: 
it measures how much information on average c and d convey about a 
and b, assuming e and / are known. 



Figure 8.1 shows how the total entropy H(X,Y) of a joint ensemble can be 
broken down. This figure is important. 
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8 — Dependent Random Variables 



H(X,Y) 



H(X) 



H(Y) 



H(X\Y) 



I(X;Y) H(Y\X) 



Figure 8.1. The relationship 
between joint information, 
marginal entropy, conditional 
entropy and mutual entropy. 



► 8.2 Exercises 

t> Exercise 8.1.^ Consider three independent random variables u,v,w with en- 
tropies H u , H V ,H W . Let X = {U, V) and Y = (V, W). What is H(X, V)? 
What is H{X | Y)1 What is I{X; Y)l 

> Exercise 8.2. p - 142 l Referring to the definitions of conditional entropy (8.3- 
8.4), confirm (with an example) that it is possible for H(X \y = bk) to 
exceed H(X), but that the average, H(X\Y), is less than H(X). So 
data are helpful - they do not increase uncertainty, on average. 



Exercise 8.3.^' P- 143 l Prove the chain rule for entropy, equation 
[H(X,Y) = H(X)+H(Y\X)}. 



J.7). 



Exercise 8.4.^' p l43 l Prove that the mutual information I{X;Y) = H(X) 
H(X | Y) satisfies I{X; Y) = I(Y; X) and I{X; Y) > 0. 

[Hint: see exercise 2.26 (p. 37) and note that 



I(X;Y) = D KL (P(x,y)\\P(x)P(y)).} 



Ul) 



Exercise 8.5.^' The 'entropy distance' between two random variables can be 
defined to be the difference between their joint entropy and their mutual 
information: 

D H (X,Y) = H(X,Y)-I(X;Y). (8.12) 

Prove that the entropy distance satisfies the axioms for a distance - 
D H (X, Y) > 0, D H (X, X) = 0, D H (X, Y) = D H (Y, X), and D H (X, Z) < 
D H {X,Y) + D H (Y,Z). [Incidentally, we are unlikely to see D H (X,Y) 
again but it is a good function on which to practise inequality-proving.] 

Exercise 8.6.^ A joint ensemble XY has the following joint distribution. 

12 3 4 



1 
2 
3 
4 



P(x 


,y) 


1 


X 

2 


3 


4 




i 


Vs 


Vl6 


V32 


V32 


y 


2 


Vl6 


Vs 


V32 


V32 




3 


Vl6 


Vl6 


Vl6 


Vl6 




4 


V4 












What is the joint entropy H{X, Y)1 What are the marginal entropies 
H{X) and H{Y)1 For each value of y, what is the conditional entropy 
H(X | y)l What is the conditional entropy H(X \ Y)l What is the 
conditional entropy of Y given XI What is the mutual information 
between X and Y1 
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8.3: Further exercises 

Exercise 8.7.^' P- 143 l Consider the ensemble XYZ in which Ax 
Az = {0,1}, x and y are independent with Vx = {p, 1 
Py = {q, l — ?} and 

z = (x + y) mod 2. 

(a) If q = V2, what is What is I(Z; X)? 

(b) For general p and q, what is Vz^ What is I(Z; X)l Notice that 
this ensemble is related to the binary symmetric channel, with x = 
input, y = noise, and z = output. 




Figure 8.2. A misleading 
representation of entropies 
(contrast with figure 8.1). 



Three term entropies 

Exercise 8.8. t 5 ' p ' 143 ^ Many texts draw figure 8.1 in the form of a Venn diagram 
(figure 8.2). Discuss why this diagram is a misleading representation 
of entropies. Hint: consider the three- variable ensemble XYZ in which 
x g {0, 1} and y G {0, 1} are independent binary variables and z € {0, 1} 
is defined to be z = x + y mod 2. 

► 8.3 Further exercises 

The data-processing theorem 

The data processing theorem states that data processing can only destroy 
information. 

Exercise 8.9. P- 144 ] Prove this theorem by considering an ensemble WDR 
in which w is the state of the world, d is data gathered, and r is the 
processed data, so that these three variables form a Markov chain 

w->d->r, (8.14) 

that is, the probability P(w, d, r) can be written as 

P(w, d, r) = P(w)P(d | w)P(r \ d). (8.15) 

Show that the average information that R conveys about W, I(W; R), is 
less than or equal to the average information that D conveys about W, 
I(W;D). 

This theorem is as much a caution about our definition of 'information' as it 
is a caution about data processing! 



= A Y = 
— p} and 

(8.13) 
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142 8 — Dependent Random Variables 

Inference and information measures 
Exercise 8.10. L 1 The three cards. 



(a) One card is white on both faces; one is black on both faces; and one 
is white on one side and black on the other. The three cards are 
shuffled and their orientations randomized. One card is drawn and 
placed on the table. The upper face is black. What is the colour of 
its lower face? (Solve the inference problem.) 

(b) Does seeing the top face convey information about the colour of 
the bottom face? Discuss the information contents and entropies 
in this situation. Let the value of the upper face's colour be u and 
the value of the lower face's colour be I. Imagine that we draw 
a random card and learn both u and I. What is the entropy of 
u, H{U)1 What is the entropy of I, H(L)1 What is the mutual 
information between U and L, I(U;L)1 

Entropies of Markov processes 

> Exercise 8.11. ^1 In the guessing game, we imagined predicting the next letter 
in a document starting from the beginning and working towards the end. 
Consider the task of predicting the reversed text, that is, predicting the 
letter that precedes those already known. Most people find this a harder 
task. Assuming that we model the language using an N-graia model 
(which says the probability of the next character depends only on the 
N — 1 preceding characters), is there any difference between the average 
information contents of the reversed language and the forward language? 

► 8.4 Solutions 

Solution to exercise 8.2 (p. 140). See exercise 8.6 (p. 140) for an example where 
H(X | y) exceeds H{X) (set y = 3). 

We can prove the inequality H(X | Y) < H(X) by turning the expression 
into a relative entropy (using Bayes' theorem) and invoking Gibbs' inequality 
(exercise 2.26 (p. 37)): 



H(X\Y) ee £ P(y) 



yeA Y 



£ P(*|y)Iog^ 

■ r- A VI 



xeA x 



y) 



= E P(»,y) log ^ (8.16) 



xy 



= E p^) lo s + E p ( x ) E p (y i ') lo s ( 8 - 18 ) 

x ^ ' x y 

The last expression is a sum of relative entropies between the distributions 
P{y\x) and P(y). So 

H (X | Y) < H (X) + 0, (8.19) 

with equality only if P(y \ x) = P(y) for all x and y (that is, only if X and Y 
are independent). 
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8.4: Solutions 



Solution to exercise 8.3 (p. 140). The chain rule for entropy follows from the 
decomposition of a joint probability: 



H(X,Y) = ^P(x,y)log 



1 



P(x,y) 



xy 



log 



P(x) 



log 



P{y\x)\ 



(8.20) 
(8.21) 



x ) lQ g p ^ y | (8-22) 
(8.23) 

(8.24) 
(8.25) 

(8.26) 

(8.27) 

(8.28) 



P(x) 

= H{X) + H(Y\X). 
Solution to exercise 8.4 (p. 140). Symmetry of mutual information: 
I(X;Y) = H(X)-H(X\Y) 

= E^) lo g^~5 P(x,y)log ^^ 

xy 



^P(x,y)log 



P(x) 
P{x,y) 



P{x)P{y) 

This expression is symmetric in x and y so 

I(X; Y) = H(X) - H(X \ Y) = H(Y) - H(Y \ X). 



We can prove that mutual information is positive two ways. One is to continue 
from 

I(X;Y) = J2P(x,y)lo g ^l 



P(x)P(y) 



.29) 



which is a relative entropy and use Gibbs' inequality (proved on p. 44), which 
asserts that this relative entropy is > 0, with equality only if P(x,y) = 
P(x)P(y), that is, if X and Y are independent. 
The other is to use Jensen's inequality on 

- £ P(x, y) log > ~ log £ ^4p(x)P(y) = log 1 = 0. (8.30) 



P(x,y) 



x,y - x,y 

Solution to exercise 8.7 (p. 141). z = x + ymod2. 

(a) If q = V2, Vz = {Vl, l h) and I{Z; X) = H(Z) - H(Z \ X) = 1 - 1 = 0. 

(b) For general q andp, Vz = {P9+(l — p)(l — <l)i p(1—q)+q(1—p)}- The mutual 
information is I(Z;X) = H(Z)-H(Z\X) = H 2 {pq+(l-p){l-q))-H 2 {q). 



Three term entropies 

Solution to exercise 8.8 (p. 141). The depiction of entropies in terms of Venn 
diagrams is misleading for at least two reasons. 

First, one is used to thinking of Venn diagrams as depicting sets; but what 
are the 'sets' H(X) and H(Y) depicted in figure 8.2, and what are the objects 
that are members of those sets? I think this diagram encourages the novice 
student to make inappropriate analogies. For example, some students imagine 
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144 8 — Dependent Random Variables 




that the random outcome (x,y) might correspond to a point in the diagram, 
and thus confuse entropies with probabilities. 

Secondly, the depiction in terms of Venn diagrams encourages one to be- 
lieve that all the areas correspond to positive quantities. In the special case of 
two random variables it is indeed true that H(X \ Y), I(X; Y) and H(Y \ X) 
are positive quantities. But as soon as we progress to three- variable ensembles, 
we obtain a diagram with positive-looking areas that may actually correspond 
to negative quantities. Figure 8.3 correctly shows relationships such as 

H(X) + H(Z | X) + H(Y \X,Z) = H(X, Y, Z). (8.31) 

But it gives the misleading impression that the conditional mutual information 
I(X; Y | Z) is less than the mutual information I(X;Y). In fact the area 
labelled A can correspond to a negative quantity. Consider the joint ensemble 
(X, Y, Z) in which x G {0, 1} and y G {0, 1} are independent binary variables 
and z G {0,1} is defined to be z = x + ymod2. Then clearly H{X) = 
H(Y) = 1 bit. Also H(Z) = 1 bit. And H{Y | X) = H(Y) = 1 since the two 
variables are independent. So the mutual information between X and Y is 
zero. I{X; Y) = 0. However, if z is observed, X and Y become dependent — 
knowing x, given z, tells you what y is: y = z — a; mod 2. So I{X; Y \ Z) = 1 
bit. Thus the area labelled A must correspond to —1 bits for the figure to give 
the correct answers. 

The above example is not at all a capricious or exceptional illustration. The 
binary symmetric channel with input X, noise Y, and output Z is a situation 
in which I(X; Y) = (input and noise are independent) but I{X; Y \ Z) > 
(once you see the output, the unknown input and the unknown noise are 
intimately related!). 

The Venn diagram representation is therefore valid only if one is aware 
that positive areas may represent negative quantities. With this proviso kept 
in mind, the interpretation of entropies in terms of sets can be helpful (Yeung, 
1991). 

Solution to exercise 8.9 (p. 141). For any joint ensemble XYZ, the following 
chain rule for mutual information holds. 

I(X;Y,Z) = I(X;Y) + I(X;Z\Y). (8.32) 

Now, in the case w — > d — > r, w and r are independent given d, so 
I{W; R | D) = 0. Using the chain rule twice, we have: 

I(W; D, R) = I(W; D) (8.33) 

and 

I{W;D,R) = I(W;R) + I{W;D\R), (8.34) 



so 



I(W;R) - I(W;D) < 0. 



(8.35) 
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About Chapter 9 



Before reading Chapter 9, you should have read Chapter 1 and worked on 
exercise 2.26 (p. 37), and exercises 8.2-8.7 (pp. 140-141). 
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9 



Communication over a Noisy Channel 



► 9.1 The big picture 



Source 



Source 
coding 



Compressor 



Decompressor 



Channel 
coding 



Encoder 



Decoder 



Noisy 
channel 



In Chapters 4-6, we discussed source coding with block codes, symbol codes 
and stream codes. We implicitly assumed that the channel from the compres- 
sor to the decompressor was noise-free. Real channels are noisy. We will now 
spend two chapters on the subject of noisy-channel coding - the fundamen- 
tal possibilities and limitations of error-free communication through a noisy 
channel. The aim of channel coding is to make the noisy channel behave like 
a noiseless channel. We will assume that the data to be transmitted has been 
through a good compressor, so the bit stream has no obvious redundancy. The 
channel code, which makes the transmission, will put back redundancy of a 
special sort, designed to make the noisy received signal decodeable. 

Suppose we transmit 1000 bits per second with Po = Pi = V2 over a 
noisy channel that flips bits with probability / = 0.1. What is the rate of 
transmission of information? We might guess that the rate is 900 bits per 
second by subtracting the expected number of errors per second. But this is 
not correct, because the recipient does not know where the errors occurred. 
Consider the case where the noise is so great that the received symbols are 
independent of the transmitted symbols. This corresponds to a noise level of 
/ = 0.5, since half of the received symbols are correct due to chance alone. 
But when / = 0.5, no information is transmitted at all. 

Given what we have learnt about entropy, it seems reasonable that a mea- 
sure of the information transmitted is given by the mutual information between 
the source and the received signal, that is, the entropy of the source minus the 
conditional entropy of the source given the received signal. 

We will now review the definition of conditional entropy and mutual in- 
formation. Then we will examine whether it is possible to use such a noisy 
channel to communicate reliably. We will show that for any channel Q there 
is a non-zero rate, the capacity C(Q), up to which information can be sent 
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9.2: Review of probability and information 
with arbitrarily small probability of error. 

► 9.2 Review of probability and information 

As an example, we take the joint distribution XY from exercise 8.6 (p. 140). 
The marginal distributions P(x) and P{y) are shown in the margins. 



P(x,y) 


1 


2 


X 

3 


4 


P(y) 


1 


Vs 


Vl6 


1/32 


1/32 


1/4 


y 2 


Vl6 


Vs 


1/32 


1/32 


1/4 


3 


Vl6 


Vl6 


Vl6 


Vl6 


1/4 


4 


V4 











1/4 


P{x) 


V2 


1/4 


1/8 


1/8 





The joint entropy is H(X, Y) = 27/8 bits. The marginal entropies are H(X) = 
7/4 bits and H{Y) = 2 bits. 

We can compute the conditional distribution of x for each value of y, and 
the entropy of each of those conditional distributions: 



P(x 


\y) 


1 


X 

2 


3 


4 


H(X 1 2/)/bits 




l 


1/2 


1/4 


1/8 


1/8 


V4 


y 


2 


1/4 


1/2 


1/8 


1/8 


V4 




3 


1/4 


1/4 


1/4 


1/4 


2 




4 


1 















H(X | Y) = H/8 



Note that whereas H(X | j/ = 4) = is less than H(X), H(X \ y = 3) is greater 
than H{X). So in some cases, learning y can increase our uncertainty about 
x. Note also that although P(x \ y = 2) is a different distribution from P(x), 
the conditional entropy H(X \y = 2) is equal to H(X). So learning that y 
is 2 changes our knowledge about x but does not reduce the uncertainty of 
x, as measured by the entropy. On average though, learning y does convey 
information about x, since H(X \ Y) < H(X). 

One may also evaluate H(Y\X) = 13/8 bits. The mutual information is 
I(X; Y) = H(X) - H(X | Y) = 3/8 bits. 

► 9.3 Noisy channels 

A discrete memoryless channel Q is characterized by an input alphabet 
Ax, an output alphabet Ay, and a set of conditional probability distri- 
butions P(y | x), one for each x € Ax- 

These transition probabilities may be written in a matrix 

Q j \ i = P(y = b j \x = a i ). (9.1) 

I usually orient this matrix with the output variable j indexing the rows 
and the input variable i indexing the columns, so that each column of Q is 
a probability vector. With this convention, we can obtain the probability 
of the output, p Y , from a probability distribution over the input, px, by 
right-multiplication: 

Py = Qpx- (9.2) 
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Some useful model channels are: 

Binary symmetric channel. ^^ = {0,1}. .4y = {0, 1}. 



x X y 
i^i 



P( y = 0ja; = 0) 
P{y=l\x = 0) 



1-f; P(y = 0\x=l) 



/; 



P(y=l\x=l) = 



Binary erasure channel. Ax = {0,1}. *4y = {0,?,l}. 




P(y=0\ X = 0) 

P{y = ?\x = 0) 
P{y=l\x = 0) 



1-/; P(y = 0\x=l) = 



/; 
0; 



P{y = ?\x=l) 
P(y=l\x=l) 



Noisy typewriter. Ax = Ay = the 27 letters {A, B, . . . , Z, -}. The letters 
are arranged in a circle, and when the typist attempts to type B, what 
comes out is either A, B or C, with probability Vz each; when the input is 
C, the output is B, C or D; and so forth, with the final letter '-' adjacent 
to the first letter A. 




P(y = T\x = G) 
P{y=Q\x = Q) 
P{y = l\\x = G) 



1/3 
1/3 
1/3 



/; 

!-/• 



0; 
/; 

!-/• 



ABCDEFGHIJKLMNOPQRSTUVWXYZ- 



Z channel. A x = {0,l}- A Y = {0,1}. 



0—^0 



P(y=0\ X = 0) 

P{y=l\x = 0) 



1; P(y = 0\x=l) = /; 
0; P(y = l|!c=l) = 1-f. 



► 9.4 Inferring the input given the output 

If we assume that the input a; to a channel comes from an ensemble X, then 
we obtain a joint ensemble XY in which the random variables x and y have 
the joint distribution: 

P(x,y) =P(y\x)P(x). (9.3) 

Now if we receive a particular symbol y, what was the input symbol xl We 
typically won't know for certain. We can write down the posterior distribution 
of the input using Bayes' theorem: 

P(y\x)P(x) P(y\x)P(x) 



P(x | y) 



P{y) 



J2 x ,P(y\x')P(x'Y 



(9.4) 



Example 9.1. Consider a binary symmetric channel with probability of error 
/ = 0.15. Let the input ensemble be Vx ■ {po = 0.9, pi = 0.1}. Assume 
we observe y = 1 . 



P(x=l\y=l) = 



P(y=l\ X =l)P(x=l) 

£ x , P(y\x')P(x') 

0.85 x 0.1 

0.85 x 0.1 + 0.15 x0.9 
0.085 

= 0.39. 

0.22 



(9.5) 
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9.5: Information conveyed by a channel 

Thus l x = V is still less probable than l x= 0', although it is not as im- 
probable as it was before. 

Exercise 9.2. [1 ' p ' 157] Now assume we observe y = 0. Compute the probability 
"S~ of x = 1 given // I). 



Example 9.3. Consider a Z channel with probability of error / = 0.15. Let the 
input ensemble be Vx ■ {po = 0.9, pi = 0.1}. Assume we observe y = l. 

0.85x0.1 

P(x = l\y = l) 



0.85 x 0.1 + x0.9 

0.085 . . 

= = 1.0. (9.6) 

0.085 1 ' 

So given the output y = 1 we become certain of the input. 

Exercise 9a} 1 ' P- 157 l Alternatively, assume we observe y = 0. Compute 
P(x = l\y = 0). 



► 9.5 Information conveyed by a channel 

We now consider how much information can be communicated through a chan- 
nel. In operational terms, we are interested in finding ways of using the chan- 
nel such that all the bits that are communicated are recovered with negligible 
probability of error. In mathematical terms, assuming a particular input en- 
semble X, we can measure how much information the output conveys about 
the input by the mutual information: 

I(X; Y) = H(X) - H(X \ Y) = H(Y) - H(Y\X). (9.7) 

Our aim is to establish the connection between these two ideas. Let us evaluate 
I(X; Y) for some of the channels above. 



Hint for computing mutual information 

We will tend to think of I{X;Y) as H{X) - H(X\Y), i.e., how much the 
uncertainty of the input X is reduced when we look at the output Y. But for 
computational purposes it is often handy to evaluate H(Y) — H(Y\X) instead. 



H(X,Y) 



H(X) 



H(Y) 



H(X\Y) 



I(X;Y) H(Y\X) 



Figure 9.1. The relationship 
between joint information, 
marginal entropy, conditional 
entropy and mutual entropy. 
This figure is important, so I'i 
showing it twice. 



Example 9.5. Consider the binary symmetric channel again, with / = 0.15 and 
"Px '■ {Po = 0.9, Pi = 0.1}. We already evaluated the marginal probabil- 
ities P(y) implicitly above: P(y = 0) = 0.78; P(y = l) = 0.22. The 
mutual information is: 



I(X;Y) = H(Y)-H(Y\X). 
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What is H(Y\X)1 It is defined to be the weighted sum over x of H(Y \ x); 
but H(Y | x) is the same for each value of x: H(Y \x = 0) is #2(0.15), 
and H(Y \x=l) is # 2 (0.15). So 

I(X;Y) = H(Y)-H(Y\X) 

= # 2 (0.22) -# 2 (0.15) 

= 0.76 - 0.61 = 0.15 bits. (9.8) 

This may be contrasted with the entropy of the source H{X) = 
ff 2 (0.1) = 0.47 bits. 

Note: here we have used the binary entropy function #2(p) = H(p, 1 — 

p) = p\og | + (1 — p) log 7jz^y- Throughout this book, log means 



lot 



2- 



Example 9.6. And now the Z channel, with Vx as above. P(y = 1) =0.085. 

I(X-Y) = H(Y)-H(Y\X) 

= # 2 (0.085) - [0.9# 2 (0) + 0.1# 2 (0.15)] 

= 0.42 - (0.1 x 0.61) = 0.36 bits. (9.9) 

The entropy of the source, as above, is H(X) = 0.47 bits. Notice that 
the mutual information I(X; Y) for the Z channel is bigger than the 
mutual information for the binary symmetric channel with the same /. 
The Z channel is a more reliable channel. 



Exercise 9.7 M' P 15 ^] Compute the mutual information between X and Y for 
the binary symmetric channel with / = 0.15 when the input distribution 
isPx = {po = 0.5, Pl = 0.5}. 

Exercise 9.8.^' p - 157 l Compute the mutual information between X and Y for 
the Z channel with / = 0.15 when the input distribution is Vx '■ 
{p = 0.5,pi = 0.5}. 



Maximizing the mutual information 

We have observed in the above examples that the mutual information between 
the input and the output depends on the chosen input ensemble. 

Let us assume that we wish to maximize the mutual information conveyed 
by the channel by choosing the best possible input ensemble. We define the 
capacity of the channel to be its maximum mutual information. 

The capacity of a channel Q is: 

C{Q) = max I (X;Y). (9.10) 

Vx 

The distribution Vx that achieves the maximum is called the optimal 
input distribution, denoted by V* x . [There may be multiple optimal 
input distributions achieving the same value of I{X; Y).] 

In Chapter 10 we will show that the capacity does indeed measure the maxi- 
mum amount of error-free information that can be transmitted over the chan- 
nel per unit time. 

Example 9.9. Consider the binary symmetric channel with / = 0.15. Above, 
we considered V x = {po = 0.9, pi = 0.1}, and found I{X; Y) = 0.15 bits. 
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How much better can we do? By symmetry, the optimal input distribu- 
tion is {0.5, 0.5} and the capacity is 

C(Qbsc) = # 2 (0.5) - H 2 (0.15) = 1.0-0.61 = 0.39bits. (9.11) 

We'll justify the symmetry argument later. If there's any doubt about 
the symmetry argument, we can always resort to explicit maximization 
of the mutual information I(X; Y), 

I(X;Y) = H 2 ((l-f) Pi + (l- Pl )f)-H 2 (f) (figure 9.2). (9.12) 

Example 9.10. The noisy typewriter. The optimal input distribution is a uni- 
form distribution over x, and gives C = log 2 9 bits. 

Example 9.11. Consider the Z channel with / = 0.15. Identifying the optimal 
input distribution is not so straightforward. We evaluate I{X; Y) explic- 
itly for Vx = {potPi}- First, we need to compute P{y). The probability 
of y = 1 is easiest to write down: 



P(y = l) = pi(l-/). 

Then the mutual information is: 

I(X-Y) = H(Y)-H(Y\X) 

= H 2 ( Pl (l-f))-( Po H 2 (0) 
= H 2 ( Pl (l-f))- Pl H 2 (f). 



(9.13) 



PiH 2 (f)) 



(9.14) 



This is a non-trivial function of P \, shown in figure 9.3. It is maximized 
for / = 0.15 by P \ = 0.445. We find C{Q Z ) = 0.685. Notice the optimal 
input distribution is not {0.5,0.5}. We can communicate slightly more 
information by using input symbol more frequently than 1. 

Exercise 9.12.^' P- 158 l What is the capacity of the binary symmetric channel 
for general /? 

Exercise 9.13.^' P- 158 l Show that the capacity of the binary erasure channel 
with / = 0.15 is Cbec = 0.85. What is its capacity for general /? 
Comment. 




Figure 9.2. The mutual 
information I(X; Y) for a binary 
symmetric channel with / = 0.15 
as a function of the input 
distribution. 



i(x-X) 




Figure 9.3. The mutual 
information I(X;Y) for a Z 
channel with / = 0.15 as a 
function of the input distribution. 



► 9.6 The noisy-channel coding theorem 

It seems plausible that the 'capacity' we have defined may be a measure of 
information conveyed by a channel; what is not obvious, and what we will 
prove in the next chapter, is that the capacity indeed measures the rate at 
which blocks of data can be communicated over the channel with arbitrarily 
small P robability of error. 

We make the following definitions. 

An (N, K) block code for a channel Q is a list of S = 2 K codewords 

{xW,x( 2 ),...,x^)}, xWe^, 

each of length N. Using this code we can encode a signal s € 
{1, 2, 3, . . . , 2 K } as x.( s \ [The number of codewords S is an integer, 
but the number of bits specified by choosing a codeword, K = log 2 S, is 
not necessarily an integer.] 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



152 



9 — Communication over a Noisy Channel 



The rate of the code is R = K/N bits per channel use. 

[We will use this definition of the rate for any channel, not only chan- 
nels with binary inputs; note however that it is sometimes conventional 
to define the rate of a code for a channel with q input symbols to be 
K/(Nlogq).} 

A decoder for an (N, K) block code is a mapping from the set of length- N 
strings of channel outputs, Ay , to a codeword label s £ {0, 1,2,..., 2 K }. 

The extra symbol s = can be used to indicate a 'failure'. 

The probability of block error of a code and decoder, for a given channel, 
and for a given probability distribution over the encoded signal P(s ln ), 
is: 

PB = ^P(sin)P(.W^s in |s in ). (9.15) 



The maximal probability of block error is 

Pbm = max P(s out 7^ s in | s in ). 



(9.16) 



The optimal decoder for a channel code is the one that minimizes the prob- 
ability of block error. It decodes an output y as the input s that has 
maximum posterior probability P(s |y). 



P(s\y) 



P(y\s)P(s) 

£ s , P(y\s>)P(s>) 
= argmax P(s | y). 



(9.17) 
(9.18) 



Soptimal 

A uniform prior distribution on s is usually assumed, in which case the 
optimal decoder is also the maximum likelihood decoder, i.e., the decoder 
that maps an output y to the input s that has maximum likelihood 

P(y\s). 

The probability of bit error p^ is defined assuming that the codeword 
number s is represented by a binary vector s of length K bits; it is the 
average probability that a bit of s out is not equal to the corresponding 
bit of Sin (averaging over all K bits). 

Shannon's noisy-channel coding theorem (part one). Associated with 
each discrete memoryless channel, there is a non-negative number C 
(called the channel capacity) with the following property. For any e > 
and R < C, for large enough N, there exists a block code of length N and 
rate > R and a decoding algorithm, such that the maximal probability 
of block error is < e. 



Pbm 



achievable 



C 



R 



Figure 9.4. Portion of the -R,Pbm 
plane asserted to be achievable by 
the first part of Shannon's noisy 
channel coding theorem. 



Confirmation of the theorem for the noisy typewriter channel 

In the case of the noisy typewriter, we can easily confirm the theorem, because 
we can create a completely error-free communication strategy using a block 
code of length N = 1: we use only the letters B, E, H, . . . , Z, i.e., every third 
letter. These letters form a non-confusable subset of the input alphabet (see 
figure 9.5). Any output can be uniquely decoded. The number of inputs 
in the non-confusable subset is 9, so the error-free information rate of this 
system is log 2 9 bits, which is equal to the capacity C, which we evaluated in 
example 9.10 (p. 151). 
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Figure 9.5. A non-confusable 
subset of inputs for the noisy 
typewriter. 



OOOOOOOOiHi-Hi— li— li— li— li— I. 
OOOOi— li— li— li— IOOOOi— li— li— |. 
OOi— li— lOOi— li— lOOi— li— IOOrH> 
Oi— IOi— IOi— IOi— IOi— IOi— IOi— IO' 



N ■- 



N - 




Figure 9.6. Extended channels 
obtained from a binary symmetric 
channel with transition 
probability 0.15. 



How does this translate into the terms of the theorem? The following table 
explains. 



The theorem 



How it applies to the noisy typewriter 



Associated with each discrete 
memoryless channel, there is a 
non-negative number C . 
For any e > and R < C, for large 
enough N, 

there exists a block code of length N and 
rate > R 

and a decoding algorithm, 

such that the maximal probability of 
block error is < e. 



The capacity C is log 2 9. 



No matter what e and R are, we set the blocklength TV to 1. 

The block code is {B, E, . . . , Z}. The value of K is given by 
2 K = 9, so K = log 2 9, and this code has rate log 2 9, which is 
greater than the requested value of R. 

The decoding algorithm maps the received letter to the nearest 
letter in the code; 

the maximal probability of block error is zero, which is less 
than the given e. 



► 9.7 Intuitive preview of proof 

Extended channels 

To prove the theorem for any given channel, we consider the extended channel 
corresponding to N uses of the channel. The extended channel has 1^4x1^ 
possible inputs x and |*4y| w possible outputs. Extended channels obtained 
from a binary symmetric channel and from a Z channel are shown in figures 
9.6 and 9.7, with N = 2 and N = 4. 
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0000 
1000 
0100 
1100 
0010 
1010 
0110 
1110 
0001 
1001 
0101 
1101 
0011 
1011 
0111 

1111 



OOOOOOOOi — I i — I i — I i — I i — I i — I i — I ' 
OOOOi — i — i — i — IOOOOi — . — i i — (i 



9 — Communication over a Noisy Channel 

Figure 9.7. Extended channels 
obtained from a Z channel with 
transition probability 0.15. Each 
column corresponds to an input, 
and each row is a different output. 



N ■- 



N - 



N - 



Typical y 




Typical y for a given typical x 



Typical y 



(b) 



Figure 9.8. (a) Some typical 
outputs in Ay corresponding to 
typical inputs x. (b) A subset of 
the typical sets shown in (a) that 
do not overlap each other. This 
picture can be compared with the 
solution to the noisy typewriter in 
figure 9.5. 



Exercise 9.14.'- ' p ' > Find the transition probability matrices Q for the ex- 
tended channel, with N = 2, derived from the binary erasure channel 
having erasure probability 0.15. 

By selecting two columns of this transition probability matrix, we can 
define a rate-V2 code for this channel with blocklength N = 2. What is 
the best choice of two columns? What is the decoding algorithm? 

To prove the noisy-channel coding theorem, we make use of large block- 
lengths N. The intuitive idea is that, if N is large, an extended channel looks 
a lot like the noisy typewriter. Any particular input x is very likely to produce 
an output in a small subspace of the output alphabet - the typical output set, 
given that input. So we can find a non-confusable subset of the inputs that 
produce essentially disjoint output sequences. For a given N, let us consider 
a way of generating such a non-confusable subset of the inputs, and count up 
how many distinct inputs it contains. 

Imagine making an input sequence x for the extended channel by drawing 
it from an ensemble X , where X is an arbitrary ensemble over the input 
alphabet. Recall the source coding theorem of Chapter 4, and consider the 
number of probable output sequences y. The total number of typical output 
sequences y is 2 NHtyY \ all having similar probability. For any particular typical 
input sequence x, there are about 2 NH ( Y \ X "> probable sequences. Some of these 
subsets of Ay are depicted by circles in figure 9.8a. 

We now imagine restricting ourselves to a subset of the typical inputs 
x such that the corresponding typical output sets do not overlap, as shown 
in figure 9.8b. We can then bound the number of non-confusable inputs by 
dividing the size of the typical y set, 2 NH ( Y \ by the size of each typical- y- 
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given-typical- x set, 2 NH ( Y \ X \ So the number of non-confusable inputs, if they 
are selected from the set of typical inputs x ~ X N , is < 2 NH ^ Y ^ NH ^ Y ^ X "> = 

2NI(X;Y) 

The maximum value of this bound is achieved if X is the ensemble that 
maximizes I(X;Y), in which case the number of non-confusable inputs is 
< 2 NC . Thus asymptotically up to C bits per cycle, and no more, can be 
communicated with vanishing error probability. □ 

This sketch has not rigorously proved that reliable communication really 
is possible - that's our task for the next chapter. 



► 9.8 Further exercises 



Exercise 9.15.^' P- 159 l Refer back to the computation of the capacity of the Z 
channel with / = 0.15. 

(a) Why is p\ less than 0.5? One could argue that it is good to favour 
the input, since it is transmitted without error - and also argue 
that it is good to favour the 1 input, since it often gives rise to the 
highly prized 1 output, which allows certain identification of the 
input! Try to make a convincing argument. 

(b) In the case of general /, show that the optimal input distribution 
is 

V(l-/) 



Pl ~ 1 + 2 (#2(/)/(l-/)) ' 

(c) What happens to p* if the noise level / is very close to 1? 



(9.19) 



Exercise 9. 16.^' p ' 159 l Sketch graphs of the capacity of the Z channel, the 
binary symmetric channel and the binary erasure channel as a function 
of/. 

t> Exercise 9.17.^ What is the capacity of the five-input, ten-output channel 
whose transition probability matrix is 



0.25 
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0.25 


0.25 
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0.25 














0.25 


0.25 











0.25 


0.25 














0.25 


0.25 











0.25 


0.25 



(9.20) 



Exercise 9.18.^' p l59 l Consider a Gaussian channel with binary input x € 
{— 1,-1-1} and real output alphabet Ay, with transition probability den- 
sity 

1 (y-xa) 2 
Q(y\x,a,<j) = -j==e ~2P~ , (9.21) 

where a is the signal amplitude. 

(a) Compute the posterior probability of x given y, assuming that the 
two inputs are equiprobable. Put your answer in the form 

P(x = l\y,a,a) = - ■ \ , . . (9.22) 
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Sketch the value of P(x = 1 1 y, a, a) as a function of y. 

(b) Assume that a single bit is to be transmitted. What is the optimal 
decoder, and what is its probability of error? Express your answer 
in terms of the signal-to- noise ratio a 2 /a 2 and the error function 
(the cumulative probability function of the Gaussian distribution), 



*{z) 



I" 



2ir 



2 dz. 



(9.23) 



[Note that this definition of the error function Q(z) may not corre- 
spond to other people's.] 

Pattern recognition as a noisy channel 

We may think of many pattern recognition problems in terms of communi- 
cation channels. Consider the case of recognizing handwritten digits (such 
as postcodes on envelopes). The author of the digit wishes to communicate 
a message from the set Ax = {0, 1,2,3,... ,9}; this selected message is the 
input to the channel. What comes out of the channel is a pattern of ink on 
paper. If the ink pattern is represented using 256 binary pixels, the channel 
Q has as its output a random variable y € Ay = {0, l} 256 . An example of an 
element from this alphabet is shown in the margin. 

Exercise 9.19.'- 2 -' Estimate how many patterns in Ay are recognizable as the 
character '2'. [The aim of this problem is to try to demonstrate the 
existence of as many patterns as possible that are recognizable as 2s.] 

Discuss how one might model the channel P(y\x = 2). Estimate the 
entropy of the probability distribution P(y | x = 2). 

One strategy for doing pattern recognition is to create a model for 
P(y | x) for each value of the input x = {0, 1, 2, 3, . . . , 9}, then use Bayes' 
theorem to infer x given y. 



P(x | y) = 



P{y\x)P{x) 

2~2 x >P(y\x')P(x'Y 



(9.24) 



This strategy is known as full probabilistic modelling or generative 
modelling. This is essentially how current speech recognition systems 
work. In addition to the channel model, P(y \ x), one uses a prior proba- 
bility distribution P{x), which in the case of both character recognition 
and speech recognition is a language model that specifies the probability 
of the next character/ word given the context and the known grammar 
and statistics of the language. 

Random coding 

Exercise 9.20.^' P- 16u l Given twenty-four people in a room, what is the prob- 
ability that there are at least two people present who have the same 
birthday (i.e., day and month of birth)? What is the expected number 
of pairs of people with the same birthday? Which of these two questions 
is easiest to solve? Which answer gives most insight? You may find it 
helpful to solve these problems and those that follow using notation such 
as A = number of days in year = 365 and S = number of people = 24. 

l> Exercise 9.21. ^ The birthday problem may be related to a coding scheme. 

Assume we wish to convey a message to an outsider identifying one of 






Figure 9.9. Some more 2s. 
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the twenty-four people. We could simply communicate a number s from 
As = {1,2,..., 24}, having agreed a mapping of people onto numbers; 
alternatively, we could convey a number from Ax = {1, 2, . . . , 365}, 
identifying the day of the year that is the selected person's birthday 
(with apologies to leapyearians) . [The receiver is assumed to know all 
the people's birthdays.] What, roughly, is the probability of error of this 
communication scheme, assuming it is used for a single transmission? 
What is the capacity of the communication channel, and what is the 
rate of communication attempted by this scheme? 

> Exercise 9.22.^ Now imagine that there are K rooms in a building, each 
containing q people. (You might think of K = 2 and q = 24 as an 
example.) The aim is to communicate a selection of one person from each 
room by transmitting an ordered list of K days (from Ax)- Compare 
the probability of error of the following two schemes. 

(a) As before, where each room transmits the birthday of the selected 
person. 

(b) To each if-tuple of people, one drawn from each room, an ordered 
K-twple of randomly selected days from Ax is assigned (this K- 
tuple has nothing to do with their birthdays). This enormous list 
of S = q K strings is known to the receiver. When the building has 
selected a particular person from each room, the ordered string of 
days corresponding to that X-tuple of people is transmitted. 

What is the probability of error when q = 364 and K = 1? What is the 
probability of error when q = 364 and K is large, e.g. K = 6000? 



► 9.9 Solutions 

Solution to exercise 9.2 (p. 149). If we assume we observe y = 0, 

P(x = l\y = 0) - ^ = 0^=1)^= 1) (9 25) 



0.15 x 0.1 



(9.26) 



0.15 x 0.1 + 0.85 x 0.9 

= ^4r^ = 0.019. (9.27) 

0.78 v ; 

Solution to exercise 9.4 (p. 149). If we observe y = 0, 

P(x=l|y = 0) = °- 15X °- 1 (9.28) 

v ly ' 0.15x0.1 + 1.0x0.9 v ; 

0.015 

= 0915 = °' 016 - (9 - 29) 

Solution to exercise 9.7 (p. 150). The probability that y = 1 is 0.5, so the 
mutual information is: 

I(X;Y) = H(Y) - H(Y j X) (9.30) 

= # 2 (0.5) - # 2 (0.15) (9.31) 

= l - 0.61 = 0.39 bits. (9.32) 



Solution to exercise 9.8 (p. 150). We again compute the mutual information 
using I(X; Y) = H(Y) - H(Y | X). The probability that y = is 0.575, and 
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H(Y\X) = Y JX P{x)H{Y\x) = P(x=l)H(Y\x=l) + P(x = 0)H(Y | x = 0) 
so the mutual information is: 

I(X;Y) = H{Y) - H(Y j X) (9.33) 
= ff 2 (0.575) - [0.5 x # 2 (0.15) + 0.5 x 0] (9.34) 
= 0.98 - 0.30 = 0.679 bits. (9.35) 

Solution to exercise 9.12 (p. 151). By symmetry, the optimal input distribution 
is {0.5,0.5}. Then the capacity is 

C = I{X-Y) = H{Y)-H{Y\X) (9.36) 
= # 2 (0.5) - H 2 (f) (9.37) 
= l-H 2 (f). (9.38) 

Would you like to find the optimal input distribution without invoking sym- 
metry? We can do this by computing the mutual information in the general 
case where the input ensemble is {po, P i}: 

I{X-Y) = H(Y) — H(Y | X) (9.39) 
= H 2 ( P0 f + Pl (l-f))-H 2 (f). (9.40) 

The only p-dependence is in the first term H 2 (pof + Pi(l — /)), which is 
maximized by setting the argument to 0.5. This value is given by setting 
Po = 1/2. 



Solution to exercise 9.13 (p. 151). Answer 1. By symmetry, the optimal input 
distribution is {0.5, 0.5}. The capacity is most easily evaluated by writing the 
mutual information as I(X; Y) = H(X) — H(X \ Y). The conditional entropy 
H (X | Y) is P(y)H(X \ y); when y is known, x is only uncertain if y = ?, 
which occurs with probability / /2 + //2, so the conditional entropy H(X \ Y) 
is /ff 2 (0.5). 

C = I{X-Y) = H(X)-H(X\Y) (9.41) 
= # 2 (0.5) - fH 2 {0.5) (9.42) 
= l-f- (9-43) 

The binary erasure channel fails a fraction / of the time. Its capacity is 
precisely 1 — /, which is the fraction of the time that the channel is reliable. 
This result seems very reasonable, but it is far from obvious how to encode 
information so as to communicate reliably over this channel. 

Answer 2. Alternatively, without invoking the symmetry assumed above, we 
can start from the input ensemble { P o,pi}. The probability that y = ? is 
Pof + Pif = /, and when we receive y = ?, the posterior probability of x is 
the same as the prior probability, so: 

I(X;Y) = H(X) - H(X\Y) (9.44) 
= H 2 ( P1 )- fH 2 ( Pl ) (9.45) 
= (l-f)H 2 ( Pl ). (9.46) 



This mutual information achieves its maximum value of (1 — /) whenpi = 1/2. 
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Figure 9.10. (a) The extended 
channel (N = 2) obtained from a 
binary erasure channel with 
erasure probability 0.15. (b) A 
block code consisting of the two 
codewords 00 and 11. (c) The 
optimal decoder for this code. 



Solution to exercise 9.14 (p. 153). The extended channel is shown in fig- 
ure 9.10. The best code for this channel with N = 2 is obtained by choosing 
two columns that have minimal overlap, for example, columns 00 and 11. The 
decoding algorithm returns '00' if the extended channel output is among the 
top four and '11' if it's among the bottom four, and gives up if the output is 
'??'. 

Solution to exercise 9.15 (p. 155). In example 9.11 (p. 151) we showed that the 
mutual information between input and output of the Z channel is 



I(X;Y) = H(Y)-H(Y\X) 

= H 2 ( Pl (l-f))- Pl H 2 (f). 



(9.47) 



We differentiate this expression with respect to pi, taking care not to confuse 
log 2 with log e : 



dp 



■/(x ; y) = (i-/)iog 2 



i-pi(i-/) 
pi(i-/) 



H 2 (f)- 



(9.48) 



Setting this derivative to zero and rearranging using skills developed in exer- 
cise 2.17 (p. 36), we obtain: 



so the optimal input distribution is 



i + 2 #2(/)/(i-/) • 



Pi 



1/(1-/) 



. 2 (#2(/)/(l-/)) ' 



(9.49) 



(9.50) 



As the noise level / tends to 1, this expression tends to 1/e (as you can prove 
using L'Hopital's rule). 

For all values of /, p\ is smaller than 1/2. A rough intuition for why input 1 
is used less than input is that when input 1 is used, the noisy channel injects 
entropy into the received string; whereas when input is used, the noise has 
zero entropy. Thus starting from p\ = 1/2, a perturbation towards smaller 
Pi will reduce the conditional entropy H(Y \ X) linearly while leaving H(Y) 
unchanged, to first order. H(Y) decreases only quadratically in (pi — V2). 

Solution to exercise 9.16 (p. 155). The capacities of the three channels are 
shown in figure 9.11. For any / < 0.5, the BEC is the channel with highest 
capacity and the BSC the lowest. 



Solution to exercise 9.18 (p. 155). 

ratio, given y, is 



The logarithm of the posterior probability 



a(y) = In 



P(x = l\y,a,a) 



In 



Q(y \x=l,a,a) 



ay 



P(x = - 1 1 y, a, a) Q(y \ x = - 1, a, a) <j 2 



(9.51) 




0.1 0.2 0.3 OA 0.5 0.6 0.7 0.8 0.9 1 



Figure 9.11. Capacities of the Z 
channel, binary symmetric 
channel, and binary erasure 
channel. 
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160 9 — Communication over a Noisy Channel 

Using our skills picked up from exercise 2.17 (p. 36), we rewrite this in the 
form 

p ( x=1 \y> a >°) = rth®- (9 - 52) 

The optimal decoder selects the most probable hypothesis; this can be done 
simply by looking at the sign of a(y). If a(y) > then decode as x = 1. 
The probability of error is 



Pb 



f° f~ xa 1 _ y2 / xa\ 

/ dyQ(y\x = l,a,a) = / dy — ==e 5^ = $ . (9.53) 

J-oo J-oo V27TC7 2 V O- / 



Random coding 

Solution to exercise 9.20 (p. 156). The probability that S = 24 people whose 
birthdays are drawn at random from A = 365 days all have distinct birthdays 

A(A-l)(A-2y..(A-S+l) ^ (9 54) 

The probability that two (or more) people share a birthday is one minus this 
quantity which, for S = 24 and A = 365, is about 0.5. This exact way of 
answering the question is not very informative since it is not clear for what 
value of S the probability changes from being close to to being close to 1. 

The number of pairs is S(S — l)/2, and the probability that a particular 
pair shares a birthday is I /A, so the expected number of collisions is 

This answer is more instructive. The expected number of collisions is tiny if 
S < \[A and big if S > \/~A. 

We can also approximate the probability that all birthdays are distinct, 
for small S, thus: 

A(A-1)(A-2)„.(A-S+1) = {1){1 _ 1/A){1 _2 /A) ^ {1 _(S-i)/ A) 

~ exp(0)exp(-l/A)exp(-2/A)...exp(-(S'-l)/A) (9.56) 
. ,xp(--y/)=e X pf- ^- 1)/2 y (9.57) 
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About Chapter 10 



Before reading Chapter 10, you should have read Chapters 4 and 9. Exer- 
cise 9.14 (p. 153) is especially recommended. 



Cast of characters 
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the noisy channel 


c 




the capacity of the channel 


X N 




an ensemble used to create a random code 


c 




a random code 


N 




the length of the codewords 


xW 




a codeword, the sth in the code 


s 




the number of a chosen codeword (mnemonic: the source 
selects s) 


S = 


2 K 


the total number of codewords in the code 


K = 


: l0g 2 S 


the number of bits conveyed by the choice of one codeword 
from S, assuming it is chosen with uniform probability 


s 




a binary representation of the number s 


R = 


K/N 


the rate of the code, in bits per channel use (sometimes called 
R' instead) 


s 




the decoder's guess of s 
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10 



The Noisy-Channel Coding Theorem 



► 10.1 The theorem 



The theorem has three parts, two positive and one negative. The main positive 
result is the first. 

1. For every discrete memoryless channel, the channel capacity 

C = max I(X;Y) (10.1) 

Vx 

has the following property. For any e > and R < C, for large enough N , 
there exists a code of length N and rate > R and a decoding algorithm, 
such that the maximal probability of block error is < e. 

2. If a probability of bit error pb is acceptable, rates up to R{pb) are achiev- 
able, where 

Rin) = rnkd- (10 ' 2) 

3. For any p^, rates greater than R(pt,) are not achievable. 



Pb 





R(Pb) 


1 


2 / 

/ 3 



c 



R 



Figure 10.1. Portion of the R,Pb 
plane to be proved achievable 
(1,2) and not achievable (3). 



► 10.2 Jointly-typical sequences 

We formalize the intuitive preview of the last chapter. 

We will define codewords as coming from an ensemble X N , and con- 
sider the random selection of one codeword and a corresponding channel out- 
put y, thus defining a joint ensemble (XY) N . We will use a typical-set decoder, 
which decodes a received signal y as s if xW and y are jointly typical, a term 
to be defined shortly. 

The proof will then centre on determining the probabilities (a) that the 
true input codeword is not jointly typical with the output sequence; and (b) 
that a false input codeword is jointly typical with the output. We will show 
that, for large N, both probabilities go to zero as long as there are fewer than 
2 NC codewords, and the ensemble X is the optimal input distribution. 

Joint typicality. A pair of sequences x, y of length N are defined to be 
jointly typical (to tolerance /?) with respect to the distribution P(x,y) 
if 



x is typical of P(x), i.e., 
y is typical of P(y), i.e., 
and x,y is typical of P(x,y), i.e., 



1 , 1 
1 , 1 



N °P(y) 
■log 



H(X) 
H(Y) 



</?, 



N °P(x,y) 



H(X,Y) 



<(3. 
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10.2: Jointly-typical sequences 

The jointly-typical set J N p is the set of all jointly typical sequence pairs 
of length N. 

Example. Here is a jointly- typical pair of length N = 100 for the ensemble 
P(x,y) in which P(x) has {po,p\) = (0.9,0.1) and P(y \ x) corresponds to a 
binary symmetric channel with noise level 0.2. 

X 1111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 
y 0011111111000000000000000000000000000000000000000000000000000000000000000000000000111111111111111111 

Notice that x has 10 Is, and so is typical of the probability -P(x) (at any 
tolerance j3); and y has 26 Is, so it is typical of P(y) (because P(y= 1) = 0.26); 
and x and y differ in 20 bits, which is the typical number of flips for this 
channel. 

Joint typicality theorem. Let x, y be drawn from the ensemble (XY) N 
defined by 

N 

P(x,y) = Y[P(x n ,y n ). 

n=l 

Then 

1. the probability that x,y are jointly typical (to tolerance (3) tends 
to 1 as N — > oo; 

2. the number of jointly-typical sequences \ Jn@\ is close to 2 NH ( X,Y >. 
To be precise, 

\J N(3 \ < 2 ^PW+«; (10.3) 

3. if x' ~ X N and y' ~ Y N , i.e., x' and y' are independent samples 
with the same marginal distribution as -P(x, y), then the probability 
that (x',y') lands in the jointly-typical set is about 2~ NI ( X;Y \ To 
be precise, 

P((x',y') e J N p) < 2- N ^ x ^-^ . (10.4) 

Proof. The proof of parts 1 and 2 by the law of large numbers follows that 
of the source coding theorem in Chapter 4. For part 2, let the pair x,y 
play the role of x in the source coding theorem, replacing P{x) there by 
the probability distribution P(x,y). 

For the third part, 

P((x',y')e Jnp) = E p (x)P(y) (10-5) 

(*,y)e Jn/3 

< \J Np \2- Ni - H ^-^2- N ^ Y ^ (10.6) 

< 2 N{H{X,Y)+f))-N(H{X)+H{Y)-2f)) 

= 2 - N ^ x > Y ^\ □ (10.8) 

A cartoon of the jointly-typical set is shown in figure 10.2. Two independent 
typical vectors are jointly typical with probability 

P((xV)e4^2-"( ; ^ 5 ')) (10.9) 

because the total number of independent typical pairs is the area of the dashed 
rectangle, 2 NH ( X ^2 NH ( Y \ and the number of jointly- typical pairs is roughly 
2NH(x,Y) ; so £ ne probability of hitting a jointly-typical pair is roughly 

2 NH(X,Y) i 2 NH(X)+NH(Y) = 2 -NI(X;Y)_ (10.10) 
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2 NH(X) 



2 NH(X,Y) dQts 



2 NH(Y\X) I 



2NH(Y) 



2 NH(X\Y)' 



Figure 10.2. The jointly-typical 
set. The horizontal direction 
represents Ax, the set of all input 
strings of length N. The vertical 
direction represents Ay, the set of 
all output strings of length N. 
The outer box contains all 
conceivable input-output pairs. 
Each dot represents a 
jointly-typical pair of sequences 
(x, y). The total number of 
jointly-typical sequences is about 

2 NH(X,Y) 



► 10.3 Proof of the noisy-channel coding theorem 

Analogy 

Imagine that we wish to prove that there is a baby in a class of one hundred 
babies who weighs less than 10 kg. Individual babies are difficult to catch and 
weigh. Shannon's method of solving the task is to scoop up all the babies 
and weigh them all at once on a big weighing machine. If we find that their 
average weight is smaller than 10 kg, there must exist at least one baby who 
weighs less than 10 kg - indeed there must be many! Shannon's method isn't 
guaranteed to reveal the existence of an underweight child, since it relies on 
there being a tiny number of elephants in the class. But if we use his method 
and get a total weight smaller than 1000 kg then our task is solved. 




Figure 10.3. Shannon's method for 
proving one baby weighs less than 
10 kg. 



From skinny children to fantastic codes 

We wish to show that there exists a code and a decoder having small prob- 
ability of error. Evaluating the probability of error of any particular coding 
and decoding system is not easy. Shannon's innovation was this: instead of 
constructing a good coding and decoding system and evaluating its error prob- 
ability, Shannon calculated the average probability of block error of all codes, 
and proved that this average is small. There must then exist individual codes 
that have small probability of block error. 

Random coding and typical-set decoding 

Consider the following encoding-decoding system, whose rate is R' . 

1. We fix P(x) and generate the S = 2 NR codewords of a (N,NR r ) = 
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x (3) x (l) x( 2 >X< 4 > x( 3 )xW X< 2 'x< 4 ) 

Ya 

Yb 



Yd 
Yc 



s(y a ) = o 

s(yb) = 3 



s(y d ) = 

s(y c )=4 
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Figure 10.4. (a) A random code, 
(b) Example decodings by the 
typical set decoder. A sequence 
that is not jointly typical with any 
of the codewords, such as y a , is 
decoded as s = 0. A sequence that 
is jointly typical with codeword 
x' li ' alone, y&, is decoded as s = 3. 
Similarly, y c is decoded as s = 4. 
A sequence that is jointly typical 
with more than one codeword, 
such as yd, is decoded as s = 0. 



(a) 



(b) 



(N, K) code C at random according to 

N 

P(x) = l[P(x n ). (10.11) 

n=l 

A random code is shown schematically in figure 10.4a. 

2. The code is known to both sender and receiver. 

3. A message s is chosen from {1,2,..., 2 NR '}, and x^ s ' is transmitted. The 
received signal is y, with 

N 

P(y|x< s ))= \{P(yn\x^). (10.12) 

71=1 

4. The signal is decoded by typical-set decoding. 

Typical-set decoding. Decode y as s if (x( s ),y) are jointly typical and 
there is no other s' such that (x' s '',y) are jointly typical; 
otherwise declare a failure (s = 0). 

This is not the optimal decoding algorithm, but it will be good enough, 
and easier to analyze. The typical-set decoder is illustrated in fig- 
ure 10.4b. 

5. A decoding error occurs if s ^ s. 

There are three probabilities of error that we can distinguish. First, there 
is the probability of block error for a particular code C, that is, 

PB (C) = P{s^s\C). (10.13) 

This is a difficult quantity to evaluate for any given code. 

Second, there is the average over all codes of this block error probability, 

(p B )=E P (^ s l C ) P ( C )- ( 10 - 14 ) 
c 
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Fortunately, this quantity is much easier to evaluate than the first quantity 
P(s ^s\C). 

Third, the maximal block error probability of a code C, 



Pbm(C) = maxP(s ^ s \ s,C), 



(10.15) 



is the quantity we are most interested in: we wish to show that there exists a 
code C with the required rate whose maximal block error probability is small. 

We will get to this result by first finding the average block error probability, 
(pb)- Once we have shown that this can be made smaller than a desired small 
number, we immediately deduce that there must exist at least one code C 
whose block error probability is also less than this small number. Finally, 
we show that this code, whose block error probability is satisfactorily small 
but whose maximal block error probability is unknown (and could conceivably 
be enormous), can be modified to make a code of slightly smaller rate whose 
maximal block error probability is also guaranteed to be small. We modify 
the code by throwing away the worst 50% of its codewords. 

We therefore now embark on finding the average probability of block error. 



(Pb) is just the probability that 
there is a decoding error at step 5 
of the process on the previous 
page. 



Probability of error of typical-set decoder 

There are two sources of error when we use typical-set decoding. Either (a) 
the output y is not jointly typical with the transmitted codeword x^, or (b) 
there is some other codeword in C that is jointly typical with y. 

By the symmetry of the code construction, the average probability of error 
averaged over all codes does not depend on the selected value of s; we can 
assume without loss of generality that s = 1. 

(a) The probability that the input xW and the output y are not jointly 
typical vanishes, by the joint typicality theorem's first part (p. 163). We give a 
name, 6, to the upper bound on this probability, satisfying 6 — > as N — > oo; 
for any desired 6, we can find a blocklength N(6) such that the P((xW,y) 
Jnp) < S. 

(b) The probability that x^ s ) and y are jointly typical, for a given s' ^ 1 
is < 2- Ar ( / ( x ; y )- 3 ' 9 ), by part 3. And there are (2^' - 1) rival values of s' to 
worry about. 

Thus the average probability of error {pb) satisfies: 

2 NR ' 

(Pb) < 5+J2^ N{nX '' Yy3f>) (10-16) 

s'=2 

< 5 + 2-ZV(/(X;Y)-fl'-3/3)_ (1(U7) 

The inequality (10.16) that bounds a total probability of error Ptot by the 
sum of the probabilities P s i of all sorts of events s' each of which is sufficient 
to cause error, 

P TOT < P + P + • • • , 

is called a union bound. It is only an equality if the different events that cause 
error never occur at the same time as each other. 

The average probability of error (10.17) can be made < 26 by increasing N if 

R' < I(X; Y) - 3/3. (10.18) 

We are almost there. We make three modifications: 



1. We choose P(x) in the proof to be the optimal input distribution of the 
channel. Then the condition R' < I(X; Y) — 3/3 becomes R' < C — 3/3. 
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(a) A random code 



(b) expurgated 



2. Since the average probability of error over all codes is < 26, there must 
exist a code with mean probability of block error pb(C) < 26. 



Figure 10.5. How expurgation 
works, (a) In a typical random 
code, a small fraction of the 
codewords are involved in 
collisions - pairs of codewords are 
sufficiently close to each other 
that the probability of error when 
cither codeword is transmitted is 
not tiny. We obtain a new code 
from a random code by deleting 
all these confusable codewords, 
(b) The resulting code has slightly 
fewer codewords, so has a slightly 
lower rate, and its maximal 
probability of error is greatly 
reduced. 



3. To show that not only the average but also the maximal probability of 
error, pbm, can be made small, we modify this code by throwing away 
the worst half of the codewords - the ones most likely to produce errors. 
Those that remain must all have conditional probability of error less 
than 4(5. We use these remaining codewords to define a new code. This 
new code has 2 NR _1 codewords, i.e., we have reduced the rate from R' 
to R' — Vn (a negligible reduction, if N is large), and achieved Pbm < 4(5. 
This trick is called expurgation (figure 10.5). The resulting code may 
not be the best code of its rate and length, but it is still good enough to 
prove the noisy-channel coding theorem, which is what we are trying to 
do here. 

In conclusion, we can 'construct' a code of rate R' — ViV, where R' < C — 3/3, 
with maximal probability of error < 4(5. We obtain the theorem as stated by 
setting R' = (R+ C)/2, 6 = e/4, (3 < (C - R')/3, and N sufficiently large for 
the remaining conditions to hold. The theorem's first part is thus proved. □ 

► 10.4 Communication (with errors) above capacity 

We have proved, for any discrete memoryless channel, the achievability of a 
portion of the R,pt, plane shown in figure 10.6. We have shown that we can 
turn any noisy channel into an essentially noiseless binary channel with rate 
up to C bits per cycle. We now extend the right-hand boundary of the region 
of achievability at non-zero error probabilities. [This is called rate-distortion 
theory.} 

We do this with a new trick. Since we know we can make the noisy channel 
into a perfect channel with a smaller rate, it is sufficient to consider commu- 
nication with errors over a noiseless channel. How fast can we communicate 
over a noiseless channel, if we are allowed to make errors? 

Consider a noiseless binary channel, and assume that we force communi- 
cation at a rate greater than its capacity of 1 bit. For example, if we require 
the sender to attempt to communicate at R = 2 bits per cycle then he must 
effectively throw away half of the information. What is the best way to do 
this if the aim is to achieve the smallest possible probability of bit error? One 
simple strategy is to communicate a fraction 1 / R of the source bits, and ignore 
the rest. The receiver guesses the missing fraction 1 — 1/R at random, and 



Pb 



achievable 



C R 

Figure 10.6. Portion of the R,Pb 
plane proved achievable in the 
first part of the theorem. [We've 
proved that the maximal 
probability of block error pbm can 
be made arbitrarily small, so the 
same goes for the bit error 
probability p b , which must be 
smaller than pbm-] 
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the average probability of bit error is 

Pb = l(i - w 



(10.19) 



The curve corresponding to this strategy is shown by the dashed line in fig- 
ure 10.7. 

We can do better than this (in terms of minimizing pb) by spreading out 
the risk of corruption evenly among all the bits. In fact, we can achieve 
Pb = H^ l {l — l/R), which is shown by the solid curve in figure 10.7. So, how 
can this optimum be achieved? 

We reuse a tool that we just developed, namely the (N, K) code for a 
noisy channel, and we turn it on its head, using the decoder to define a lossy 
compressor. Specifically, we take an excellent (N, K) code for the binary 
symmetric channel. Assume that such a code has a rate R' = K/N , and that 
it is capable of correcting errors introduced by a binary symmetric channel 
whose transition probability is q. Asymptotically, r&te-R' codes exist that 
have R' ~ 1 — H2(q). Recall that, if we attach one of these capacity-achieving 
codes of length N to a binary symmetric channel then (a) the probability 
distribution over the outputs is close to uniform, since the entropy of the 
output is equal to the entropy of the source (NR') plus the entropy of the 
noise (NH2{q)), and (b) the optimal decoder of the code, in this situation, 
typically maps a received vector of length A to a transmitted vector differing 
in qN bits from the received vector. 

We take the signal that we wish to send, and chop it into blocks of length N 
(yes, A, not K). We pass each block through the decoder, and obtain a shorter 
signal of length K bits, which we communicate over the noiseless channel. To 
decode the transmission, we pass the K bit message to the encoder of the 
original code. The reconstituted message will now differ from the original 
message in some of its bits - typically qN of them. So the probability of bit 
error will be pb = q. The rate of this lossy compressor is R = N/K = l/R' = 
\/{l-H 2 {p h )). 

Now, attaching this lossy compressor to our capacity-C error-free commu- 
nicator, we have proved the achievability of communication up to the curve 
(pb, R) defined by: 

*=Tdk? D (ia20 » 

For further reading about rate-distortion theory, see Gallager (1968), p. 451, 
or McEliece (2002), p. 75. 




Figure 10.7. A simple bound on 
achievable points (R,pb), and 
Shannon's bound. 



► 10.5 The non-achievable region (part 3 of the theorem) 

The source, encoder, noisy channel and decoder define a Markov chain: s — > x — > y — > s 

P(s, x, y, s) = P(s)P(x | s)P(y | x)P(s j y). (10.21) 

The data processing inequality (exercise 8.9, p. 141) must apply to this chain: 
I{s', s) < /(x; y). Furthermore, by the definition of channel capacity, /(x; y) < 
NC, so I(s; s) < NC. 

Assume that a system achieves a rate R and a bit error probability p^ 
then the mutual information I(s; s) is > NR(l — -£/2(.Pb))- But I(s; s) > NC 
is not achievable, so R > — — r is not achievable. □ 

Exercise 10.1. ^ Fill in the details in the preceding argument. If the bit errors 
between s and s are independent then we have I(s; s) = NR(1 — i?2(Pb))- 
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What if we have complex correlations among those bit errors? Why does 
the inequality I(s; s) > NR(1 - H 2 {pb)) hold? 



► 10.6 Computing capacity 

We have proved that the capacity of a channel is the maximum rate at which Sections 10.6-10.8 contain 



reliable communication can be achieved. How can we compute the capacity of 
a given discrete memoryless channel? We need to find its optimal input distri- 
bution. In general we can find the optimal input distribution by a computer 
search, making use of the derivative of the mutual information with respect 
to the input probabilities. 

[> Exercise 10.2. ^ Find the derivative of I(X; Y) with respect to the input prob- 
ability pi, dI(X; Y)/dpi, for a channel with conditional probabilities Qj\i- 



advanced material. The first-time 
reader is encouraged to skip to 
section 10.9 (p. 172). 



Exercise 10.3. ^ Show that I(X; Y) is a concave - 
ability vector p. 



function of the input prob- 



Since I(X; Y) is concave-—- in the input distribution p, any probability distri- 
bution p at which I(X; Y) is stationary must be a global maximum of I(X; Y). 
So it is tempting to put the derivative of I(X; Y) into a routine that finds a 
local maximum of I(X; Y), that is, an input distribution P(x) such that 



dI(X;Y) 
dpi 



= X for all i, 



(10.22) 



where A is a Lagrange multiplier associated with the constraint J2iPi = 1- 
However, this approach may fail to find the right answer, because I(X;Y) 
might be maximized by a distribution that has pi = for some inputs. A 
simple example is given by the ternary confusion channel. 

Ternary confusion channel. „4 X = {0, ?,!}• Ay = {0, !}• 



P(y- 



■ Q\x- 

■ l\x-- 



= 0) 
= 0) 



P(y- 



-Q\X- 

■ l\x-- 



?) 
?) 



1/2; 
1/2; 



P(y- 
P{y- 



■ti\x- 
■ l\x-- 



:1) 
:1) 



0; 
1. 



Whenever the input ? is used, the output is random; the other inputs 
are reliable inputs. The maximum information rate of 1 bit is achieved 
by making no use of the input ?. 

l> Exercise 10.4.^' P- 173 l Sketch the mutual information for this channel as a 
function of the input distribution p. Pick a convenient two-dimensional 
representation of p. 



The optimization routine must therefore take account of the possibility that, 
as we go up hill on I(X; Y), we may run into the inequality constraints pi > 0. 

> Exercise 10.5. p,174 l Describe the condition, similar to equation (10.22), that 
is satisfied at a point where I(X; Y) is maximized, and describe a com- 
puter program for finding the capacity of a channel. 
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Results that may help in finding the optimal input distribution 

1. All outputs must be used. 

2. I(X,Y) is a convex-—- function of the channel parameters. 

3. There may be several optimal input distributions, but they all look the 
same at the output. 

> Exercise 10.6. ^ Prove that no output y is unused by an optimal input distri- 
bution, unless it is unreachable, that is, has Q(y \ x) = for all x. 

bxercise 10.7. [2] Prove that I(X,Y) is a convex -~ ' function of Q(y \ x). 

Exercise 10.8.^ Prove that all optimal input distributions of a channel have 
the same output probability distribution P(y) = J2x P( x )Q(y I x )- 

These results, along with the fact that I(X; Y) is a concave ^— • function of 
the input probability vector p, prove the validity of the symmetry argument 
that we have used when finding the capacity of symmetric channels. If a 
channel is invariant under a group of symmetry operations - for example, 
interchanging the input symbols and interchanging the output symbols - then, 
given any optimal input distribution that is not symmetric, i.e., is not invariant 
under these operations, we can create another input distribution by averaging 
together this optimal input distribution and all its permuted forms that we 
can make by applying the symmetry operations to the original optimal input 
distribution. The permuted distributions must have the same I(X; Y) as the 
original, by symmetry, so the new input distribution created by averaging 
must have I(X; Y) bigger than or equal to that of the original distribution, 
because of the concavity of I. 

Symmetric channels 

In order to use symmetry arguments, it will help to have a definition of a 
symmetric channel. I like Gallager's (1968) definition. 

A discrete memoryless channel is a symmetric channel if the set of 

outputs can be partitioned into subsets in such a way that for each 
subset the matrix of transition probabilities has the property that each 
row (if more than 1) is a permutation of each other row and each column 
is a permutation of each other column. 

Example 10.9. This channel 



is a symmetric channel because its outputs can be partitioned into (0, 1) 
and ?, so that the matrix can be rewritten: 



Reminder: The term 'convex-—-' 
means 'convex', and the term 
'concave--^' means 'concave'; the 
little smile and frown symbols are 
included simply to remind you 
what convex and concave mean. 



P( y = 0\x=0) 
P(y = ?\x=0) 
P(y=l\ X = 0) 



0.7; P(y = 
0.2; P(y = 
0.1; P(y = 



0|x = l) = 0.1; 
?\x = l) = 0.2; 
l|x = l) = 0.7. 



(10.23) 



P( y = 0\x = 0) = 0.7; P{y = Q\x = l) = 0.1; 
P( y =l\x = 0) = 0.1; P(y=l\x=l) = 0.7; 



(10.24) 



P( y = ?\x = 0) = 0.2; P(y = ?\ X =l) = 0.2. 
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10.7: Other coding theorems 

Symmetry is a useful property because, as we will see in a later chapter, 
communication at capacity can be achieved over symmetric channels by linear 
codes. 

Exercise 10.10. ^ Prove that for a symmetric channel with any number of 
inputs, the uniform distribution over the inputs is an optimal input 
distribution. 

t> Exercise 10.11. ^' p ' 174 l Are there channels that are not symmetric whose op- 
timal input distributions are uniform? Find one, or prove there are 
none. 

► 10.7 Other coding theorems 

The noisy-channel coding theorem that we proved in this chapter is quite gen- 
eral, applying to any discrete memoryless channel; but it is not very specific. 
The theorem only says that reliable communication with error probability e 
and rate R can be achieved by using codes with sufficiently large blocklength 
N. The theorem does not say how large N needs to be to achieve given values 
of R and e. 

Presumably, the smaller e is and the closer R is to C, the larger N has to 

be. 

Noisy-channel coding theorem - version with explicit N -dependence 

For a discrete memoryless channel, a blocklength N and a rate R, 
there exist block codes of length N whose average probability of 
error satisfies: 

Pb < exp [-NE T (R)] (10.25) 

where E T (R) is the random-coding exponent of the channel, a 
convex---, decreasing, positive function of R for < R < C. The 
random-coding exponent is also known as the reliability function. 

[By an expurgation argument it can also be shown that there exist 
block codes for which the maximal probability of error pbm is also 
exponentially small in N.} 

The definition of E T (R) is given in Gallager (1968), p. 139. E T (R) approaches 
zero as R — > C; the typical behaviour of this function is illustrated in fig- 
ure 10.8. The computation of the random-coding exponent for interesting 
channels is a challenging task on which much effort has been expended. Even 
for simple channels like the binary symmetric channel, there is no simple ex- 
pression for E T (R). 

Lower bounds on the error probability as a function of blocklength 

The theorem stated above asserts that there are codes with pb smaller than 
exp [— NE T (R)]. But how small can the error probability be? Could it be 
much smaller? 

For any code with blocklength N on a discrete memoryless channel, 
the probability of error assuming all source messages are used with 
equal probability satisfies 




R 



Figure 10.8. A typical 
random-coding exponent. 



PB >exp[-NE sp (R)}, 



(10.26) 
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where the function E sp (R), the sphere-packing exponent of the 
channel, is a convex^--, decreasing, positive function of R for < 
R< C. 

For a precise statement of this result and further references, see Gallager 
(1968), p. 157. 



► 10.8 Noisy-channel coding theorems and coding practice 

Imagine a customer who wants to buy an error-correcting code and decoder 
for a noisy channel. The results described above allow us to offer the following 
service: if he tells us the properties of his channel, the desired rate R and the 
desired error probability pb, we can, after working out the relevant functions 
C, Ej-(R), and E sp (R), advise him that there exists a solution to his problem 
using a particular blocklength N; indeed that almost any randomly chosen 
code with that blocklength should do the job. Unfortunately we have not 
found out how to implement these encoders and decoders in practice; the cost 
of implementing the encoder and decoder for a random code with large iV 
would be exponentially large in N. 

Furthermore, for practical purposes, the customer is unlikely to know ex- 
actly what channel he is dealing with. So Berlekamp (1980) suggests that 
the sensible way to approach error-correction is to design encoding-decoding 
systems and plot their performance on a variety of idealized channels as a 
function of the channel's noise level. These charts (one of which is illustrated 
on page 568) can then be shown to the customer, who can choose among the 
systems on offer without having to specify what he really thinks his channel 
is like. With this attitude to the practical problem, the importance of the 
functions E T (R) and E sp (R) is diminished. 



► 10.9 Further exercises 

Exercise 10.12.^ A binary erasure channel with input x and output y has 
transition probability matrix: 



Q 



l-q 

q 
o 



1^*1 



Find the mutual information I(X; Y) between the input and output for 
general input distribution {jJrjj.Pi}: an d show that the capacity of this 
channel is C = 1 — q bits. 



A Z channel has transition probability matrix: 

Q = [ 1 q 1 7° 

Show that, using a (2, 1) code, two uses of a Z channel can be made to 
emulate one use of an erasure channel, and state the erasure probability 
of that erasure channel. Hence show that the capacity of the Z channel, 
Cz, satisfies Cz > — q) bits. 

Explain why the result Cz > 5(1 — q) is an inequality rather than an 
equality. 
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10.10: Solutions 



Exercise lO.lS.^' P' 1 ^ A transatlantic cable contains N = 20 indistinguish- 
able electrical wires. You have the job of figuring out which wire is 
which, that is, to create a consistent labelling of the wires at each end. 
Your only tools are the ability to connect wires to each other in groups 
of two or more, and to test for connectedness with a continuity tester. 
What is the smallest number of transatlantic trips you need to make, 
and how do you do it? 

How would you solve the problem for larger N such as N = 1000? 

As an illustration, if N were 3 then the task can be solved in two steps 
by labelling one wire at one end a, connecting the other two together, 
crossing the Atlantic, measuring which two wires are connected, labelling 
them b and c and the unconnected one o, then connecting 6 to a and 
returning across the Atlantic, whereupon on disconnecting b from c, the 
identities of b and c can be deduced. 

This problem can be solved by persistent search, but the reason it is 
posed in this chapter is that it can also be solved by a greedy approach 
based on maximizing the acquired information. Let the unknown per- 
mutation of wires be x. Having chosen a set of connections of wires C at 
one end, you can then make measurements at the other end, and these 
measurements y convey information about x. How much? And for what 
set of connections is the information that y conveys about x maximized? 



► 10.10 Solutions 

Solution to exercise 10.4 (p. 169). If the input distribution is p = {po,P?,Pi), 
the mutual information is 

I(X; Y) = H(Y) - H{Y\X) = H 2 ( Po + p ? /2) - p ? . (10.27) 

We can build a good sketch of this function in two ways: by careful inspection 
of the function, or by looking at special cases. 

For the plots, the two-dimensional representation of p I will use has po and 
pi as the independent variables, so that p = (po,P?,Pi) = {po, {l—Po—Pi),Pi)- 




By inspection. If we use the quantities p* = po + p?/2 and p-> as our two 
degrees of freedom, the mutual information becomes very simple: I(X; Y) = 
H2{p*) — P?- Converting back to po = p* — p?/2 and p\ = 1 — — p?/2, 
we obtain the sketch shown at the left below. This function is like a tunnel 
rising up the direction of increasing po and p\ . To obtain the required plot of 
I(X; Y) we have to strip away the parts of this tunnel that live outside the 
feasible simplex of probabilities; we do this by redrawing the surface, showing 
only the parts where po > and p\ > 0. A full plot of the function is shown 
at the right. 
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Special cases. In the special case p? = 0, the channel is a noiseless binary 
channel, and I(X;Y) = ^(po)- 

In the special case po = pi, the term i?2(po + P?/2) is equal to 1, so 
I(X;Y) = l-p 7 . 

In the special case po = 0, the channel is a Z channel with error probability 
0.5. We know how to sketch that, from the previous chapter (figure 9.3). 
These special cases allow us to construct the skeleton shown in figure 10.9. 



Solution to exercise 10.5 (p. 169). 

maximize I(X; Y) are 



Necessary and sufficient conditions for p to 



dl(X-Y) 

dpi 
dl(X-Y) 

dpi 



< 




for all 



(10.28) 



where A is a constant related to the capacity by C = A + log 2 e. 

This result can be used in a computer program that evaluates the deriva- 
tives, and increments and decrements the probabilities pi in proportion to the 
differences between those derivatives. 

This result is also useful for lazy human capacity-finders who are good 
guessers. Having guessed the optimal input distribution, one can simply con- 
firm that equation (10.28) holds. 

Solution to exercise 10.11 (p. 171). We certainly expect nonsymmetric chan- 
nels with uniform optimal input distributions to exist, since when inventing a 
channel we have I (J — 1) degrees of freedom whereas the optimal input dis- 
tribution is just (/ — l)-dimensional; so in the /( J— l)-dimensional space of 
perturbations around a symmetric channel, we expect there to be a subspace 
of perturbations of dimension I{J — 1) — (I — 1) = I(J — 2) + 1 that leave the 
optimal input distribution unchanged. 

Here is an explicit example, a bit like a Z channel. 



Q = 



0.9585 


0.0415 


0.35 


0.0 


0.0415 


0.9585 


0.0 


0.35 








0.65 














0.65 



(10.29) 



Solution to exercise 10.13 (p. 173). The labelling problem can be solved for 
any TV > 2 with just two trips, one each way across the Atlantic. 

The key step in the information-theoretic approach to this problem is to 
write down the information content of one partition, the combinatorial object 
that is the connecting together of subsets of wires. If TV wires are grouped 
together into g\ subsets of size 1, 52 subsets of size 2, . . . , then the number of 
such partitions is 

TV' 

n=— , (10.30) 



fh 



and the information content of one such partition is the log of this quantity. 
In a greedy strategy we choose the first partition to maximize this information 
content. 

One game we can play is to maximize this information content with re- 
spect to the quantities g r , treated as real numbers, subject to the constraint 
J2 r 9r r = N. Introducing a Lagrange multiplier A for the constraint, the 
derivative is 




— log r! — log g r + \r, 



(10.31) 



0.5 - 




Figure 10.9. Skeleton of the 
mutual information for the 
ternary confusion channel. 
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10.10: Solutions 

which, when set to zero, leads to the rather nice expression 

9r = (10.32) 
r! 

the optimal g r is proportional to a Poisson distribution! We can solve for the 
Lagrange multiplier by plugging g r into the constraint ^ r g r r = N, which 
gives the implicit equation 

Af = /xe' 1 , (10.33) 

where /i = e x is a convenient reparameterization of the Lagrange multiplier. 
Figure 10.10a shows a graph of fJ,(N); figure 10.10b shows the deduced non- 
integer assignments g r when fi = 2.2, and nearby integers g r = {1,2,2,1,1} 
that motivate setting the first partition to (a)(bc)(de)(fgh)(ijk)(lmno)(pqrst). 

This partition produces a random partition at the other end, which has an 
information content of log Q. = 40.4 bits, which is a lot more than half the total 
information content we need to acquire to infer the transatlantic permutation, 
log 20! ~ 61 bits. [In contrast, if all the wires are joined together in pairs, 
the information content generated is only about 29 bits.] How to choose the 
second partition is left to the reader. A Shannonesque approach is appropriate, 
picking a random partition at the other end, using the same {g r }', you need 
to ensure the two partitions are as unlike each other as possible. 

If N 2, 5 or 9, then the labelling problem has solutions that are 
particularly simple to implement, called Knowlton-Graham partitions: par- 
tition {1, ...,N} into disjoint sets in two ways A and B, subject to the 
condition that at most one element appears both in an A set of cardinal- 
ity j and in a B set of cardinality k, for each j and k (Graham, 1966; 
Graham and Knowlton, 1968). 
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1 10 100 1000 



2.5 - 



2 - 
1.5 - 



0.5 - 

(b) o- 




123456789 10 

Figure 10.10. Approximate 
solution of the cable-labelling 
problem using Lagrange 
multipliers, (a) The parameter \i 
as a function of N; the value 
/it (20) = 2.2 is highlighted, (b) 
Non- integer values of the function 
g r = t 1 /r! are shown by lines and 
integer values of g r motivated by 
those non-integer values are 
shown by crosses. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



About Chapter 11 



Before reading Chapter 11, you should have read Chapters 9 and 10. 
You will also need to be familiar with the Gaussian distribution. 

One-dimensional Gaussian distribution. If a random variable y is Gaus- 
sian and has mean fj, and variance cr 2 , which we write: 

y ~ Normal(/i, a 2 ), or P(y) = Normal(y; fi, cr 2 ), (H-l) 

then the distribution of y is: 

P(y | M , cr 2 ) = -jL= exp [-(y - M ) 2 /2a 2 ] . (11.2) 

[I use the symbol P for both probability densities and probabilities.] 

The inverse-variance r = V<x 2 is sometimes called the precision of the 
Gaussian distribution. 

Multi-dimensional Gaussian distribution. If y = {y\,V2i ■ ■ ■ iVn) has a 
multivariate Gaussian distribution, then 

P(Y I x, A) = — exp (-i(y - x) T A(y - x)) , (11.3) 

where x is the mean of the distribution, A is the inverse of the 
variance-covariance matrix, and the normalizing constant is Z(A) = 
(det(A/27r)r 1/2 . 

This distribution has the property that the variance En of j/j, and the 
covariance Eij of j/j and yj are given by 

Eij = £ [{ Vl - yi){ Vj - yj)} = A~\ (11.4) 

where A -1 is the inverse of the matrix A. 

The marginal distribution P(yi) of one component y^ is Gaussian; 
the joint marginal distribution of any subset of the components is 
multivariate-Gaussian; and the conditional density of any subset, given 
the values of another subset, for example, P(yt \ yj), is also Gaussian. 
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11 



Error-Correcting Codes & Real Channels 



The noisy-channel coding theorem that we have proved shows that there exist 
reliable error-correcting codes for any noisy channel. In this chapter we address 
two questions. 

First, many practical channels have real, rather than discrete, inputs and 
outputs. What can Shannon tell us about these continuous channels? And 
how should digital signals be mapped into analogue waveforms, and vice versa! 

Second, how are practical error-correcting codes made, and what is 
achieved in practice, relative to the possibilities proved by Shannon? 

► 11.1 The Gaussian channel 

The most popular model of a real-input, real-output channel is the Gaussian 
channel. 

The Gaussian channel has a real input x and a real output y. The condi- 
tional distribution of y given a; is a Gaussian distribution: 

P{y\x) = -jL=^ V [-{y-x) 2 /2a 2 ]. (11.5) 

This channel has a continuous input and output but is discrete in time. 
We will show below that certain continuous-time channels are equivalent 
to the discrete-time Gaussian channel. 

This channel is sometimes called the additive white Gaussian noise 
(AWGN) channel. 

As with discrete channels, we will discuss what rate of error-free information 
communication can be achieved over this channel. 

Motivation in terms of a continuous-time channel 

Consider a physical (electrical, say) channel with inputs and outputs that are 
continuous in time. We put in x(t), and out comes y(t) = x(t) + n(t). 

Our transmission has a power cost. The average power of a transmission 
of length T may be constrained thus: 

/ dt [x(t)] 2 /T < P. (11.6) 
Jo 

The received signal is assumed to differ from x(t) by additive noise n(t) (for 
example Johnson noise), which we will model as white Gaussian noise. The 
magnitude of this noise is quantified by the noise spectral density, Nq. 
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How could such a channel be used to communicate information? Consider 
transmitting a set of N real numbers {x n }^ =1 in a signal of duration T made 
up of a weighted combination of orthonormal basis functions <f> n (t), 



it) 



N 



(11.7) 



where / Q T dt <f> n (t)(f> m (t) = 5. 



Vn 



L 



The receiver can then compute the scalars: 
dt <f> n {t)y{t) = x n + dt <j> n {t)n{t) 



— x r . 



(11.8) 

(11.9) 
The white 



This noise is 



for n = 1 ... N. If there were no noise, then y n would equal 
Gaussian noise n(t) adds scalar noise n n to the estimate y n . 
Gaussian: 

n n ~ Normal(0,AT /2), (11.10) 

where Nq is the spectral density introduced above. Thus a continuous chan- 
nel used in this way is equivalent to the Gaussian channel defined at equa- 
tion (11.5). The power constraint dt[x(t)] 2 < PT defines a constraint on 
the signal amplitudes x n , 

PT 

(11.11) 



<<PT 



N 



Before returning to the Gaussian channel, we define the bandwidth (mea- 
sured in Hertz) of the continuous channel to be: 



w 



2T 



(11.12) 



where 7V max is the maximum number of orthonormal functions that can be 
produced in an interval of length T. This definition can be motivated by 
imagining creating a band-limited signal of duration T from orthonormal co- 
sine and sine curves of maximum frequency W. The number of orthonormal 
functions is N max = 2WT. This definition relates to the Nyquist sampling 
theorem: if the highest frequency present in a signal is W, then the signal 
can be fully determined from its values at a series of discrete sample points 
separated by the Nyquist interval At = Vzw seconds. 

So the use of a real continuous channel with bandwidth W, noise spectral 
density A^o and power P is equivalent to N/T = 2W uses per second of a 
Gaussian channel with noise level a 2 = Nq/2 and subject to the signal power 
constraint x 2 < p hw. 




Mt) 
h(t) 

x(t) 



Figure 11.1. Three basis functions, 
and a weighted combination of 
them, x(t) = X)n=i x n<Pn{t), with 
x\ =0.4, x 2 = — 0.2, and x 3 = 0.1. 



Definition of E^/Nq 

Imagine that the Gaussian channel y n = x n + n n is used with an encoding 
system to transmit binary source bits at a rate of R bits per channel use. How 
can we compare two encoding systems that have different rates of communi- 
cation R and that use different powers x 2 ^. Transmitting at a large rate 7? is 
good; using small power is good too. 

It is conventional to measure the rate-compensated signal-to-noise ratio by 
the ratio of the power per source bit = x 2 /R to the noise spectral density 
N : 



E h /N = 



(11.13) 



2cr 2 R' 

E^/Nq is one of the measures used to compare coding schemes for Gaussian 
channels. 



E^/Nq is dimensionless, but it is 
usually reported in the units of 
decibels; the value given is 
101og 10 £yiVo. 
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11.2 Inferring the input to a real channel 

'The best detection of pulses' 

In 1944 Shannon wrote a memorandum (Shannon, 1993) on the problem of 
best differentiating between two types of pulses of known shape, represented 
by vectors xrj and xi, given that one of them has been transmitted over a 
noisy channel. This is a pattern recognition problem. It is assumed that the 
noise is Gaussian with probability density 



P(n) 



'I 



1/2 



exp 



in T An 

2 



(11.14) 



where A is the inverse of the variance-covariance matrix of the noise, a sym- 
metric and positive-definite matrix. (If A is a multiple of the identity matrix, 
I/cr 2 , then the noise is 'white'. For more general A, the noise is 'coloured'.) 
The probability of the received vector y given that the source signal was s 
(either zero or one) is then 



P(y\*) 



det 



2tt) 



1/2 



exp 



1 



(y - x s ) T A(y - x s ) 



The optimal detector is based on the posterior probability ratio: 

^(* = i|y) P{y\s = i)P{s = i) 



P(y|s = 0) P(a = 0) 



-(y - xi) T A(y - xi) + -(y - x ) T A(y - x ) 



P(s = 0|y) 
= exp ^ 
= exp (y T A(xi - x ) + 6) , 
where is a constant independent of the received vector y, 



In 



(11.15) 
(11.16) 

p(s=iy 



-x^Axi + -xjAx 



In 



P(s = 0)' 



P(s = 0) j 
(11.17) 



(11.18) 



If the detector is forced to make a decision (i.e., guess either s = 1 or s = 0) then 
the decision that minimizes the probability of error is to guess the most prob- 
able hypothesis. We can write the optimal decision in terms of a discriminant 
function: 

o(y) = y T A( Xl - x ) + 9 



with the decisions 



o(y) > 
o(y) < 
o(y) = 



guess s = 1 
guess s = 
guess either. 



Notice that a(y) is a linear function of the received vector, 

«(y) 

where w = A(xi — Xq). 



w T y 



(11.19) 



(11.20) 



(11.21) 



x 



xi 



Figure 11.2. Two pulses x and 
xi, represented as 31-dimensional 
vectors, and a noisy version of one 
of them, y. 



Figure 11.3. The weight vector 
w oc xi — x that is used to 
discriminate between x and xi . 



► 11.3 Capacity of Gaussian channel 

Until now we have only measured the joint, marginal, and conditional entropy 
of discrete variables. In order to define the information conveyed by continuous 
variables, there are two issues we must address - the infinite length of the real 
line, and the infinite precision of real numbers. 
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Infinite inputs 

How much information can we convey in one use of a Gaussian channel? If 
we are allowed to put any real number x into the Gaussian channel, we could 
communicate an enormous string of N digits did^d^ . . . djv by setting x = 
d\d2di . . . dArOOO . . . 000. The amount of error-free information conveyed in 
just a single transmission could be made arbitrarily large by increasing N , 
and the communication could be made arbitrarily reliable by increasing the 
number of zeroes at the end of x. There is usually some power cost associated 
with large inputs, however, not to mention practical limits in the dynamic 
range acceptable to a receiver. It is therefore conventional to introduce a 
cost function v(x) for every input x, and constrain codes to have an average 
cost v less than or equal to some maximum value. A generalized channel 
coding theorem, including a cost function for the inputs, can be proved - see 
McEliece (1977). The result is a channel capacity C(v) that is a function of 
the permitted cost. For the Gaussian channel we will assume a cost 



v(x) = x 2 



(11.22) 



such that the 'average power' x 2 of the input is constrained. We motivated this 
cost function above in the case of real electrical channels in which the physical 
power consumption is indeed quadratic in x. The constraint x 2 = v makes 
it impossible to communicate infinite information in one use of the Gaussian 
channel. 

Infinite precision 

It is tempting to define joint, marginal, and conditional entropies for real 
variables simply by replacing summations by integrals, but this is not a well 
defined operation. As we discretize an interval into smaller and smaller divi- 
sions, the entropy of the discrete distribution diverges (as the logarithm of the 
granularity) (figure 11.4). Also, it is not permissible to take the logarithm of 
a dimensional quantity such as a probability density P(x) (whose dimensions 
are [a;] -1 ). 

There is one information measure, however, that has a well-behaved limit, 
namely the mutual information - and this is the one that really matters, since 
it measures how much information one variable conveys about another. In the 
discrete case, 

/(X;y)= S PMlog ^w 



(11.23) 



Now because the argument of the log is a ratio of two probabilities over the 
same space, it is OK to have P(x,y), P(x) and P(y) be probability densities 
and replace the sum by an integral: 



I(X-Y) = JdxdyP(x,y)log 

= J dx dy P(x)P(y \ x) log 



P(x,y) 

P(x)P(y) 

P(y\x) 
P(y) ■ 



(11.24) 
(11.25) 



We can now ask these questions for the Gaussian channel: (a) what probability 
distribution P(x) maximizes the mutual information (subject to the constraint 
x 2 = v)? and (b) does the maximal mutual information still measure the 
maximum error-free communication rate of this real channel, as it did for the 
discrete channel? 



(a) 




(b) 




Figure 11.4. (a) A probability 
density P(x). Question: can wc 
define the 'entropy' of this 
density? (b) We could evaluate 
the entropies of a sequence of 
probability distributions with 
decreasing grain-size g, but these 
entropies tend to 

/P(x) log — dx, which is not 
P(x)g 

independent of g: the entropy 
goes up by one bit for every 
halving of g. 

j P(x) log p/ ^ dx is an illegal 



integral. 



P(x) 
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11.3: Capacity of Gaussian channel 

Exercise 11.1.^' P- 189 l Prove that the probability distribution P(x) that max- 
imizes the mutual information (subject to the constraint x 2 = v) is a 
Gaussian distribution of mean zero and variance v. 

t> Exercise 11.2.^' P- 189 l Show that the mutual information I(X; Y), in the case 
of this optimized distribution, is 

C=\\og(l+^). (11.26) 

This is an important result. We see that the capacity of the Gaussian channel 
is a function of the signal-to-noise ratio v/a 2 . 



Inferences given a Gaussian input distribution 

If P(x) = Normal(x; 0, v) and P(y\x) = Normal(y; x, a 2 ) then the marginal 
distribution of y is P(y) = Normal(y; 0, v + a 2 ) and the posterior distribution 
of the input, given that the output is y, is: 

P(x\y) oc P(y\x)P(x) (11.27) 
oc exp(-(y-a;) 2 /2cr 2 )exp(-a; 2 /2w) (11.28) 







1 


G- 







Normal [ x; — — y l- + —\ . (11.29) 



[The step from (11.28) to (11.29) is made by completing the square in the 
exponent.] This formula deserves careful study. The mean of the posterior 
distribution, v " a 2 Vi can be viewed as a weighted combination of the value 
that best fits the output, x = y, and the value that best fits the prior, x = 0: 

y= T~, ', 7 V + 1M 0. (11.30) 



v + a 2 * l/v + l/a 2 * 1/v + l/a 2 

The weights 1/cr 2 and 1/v are the precisions of the two Gaussians that we 
multiplied together in equation (11.28): the prior and the likelihood. 

The precision of the posterior distribution is the sum of these two pre- 
cisions. This is a general property: whenever two independent sources con- 
tribute information, via Gaussian distributions, about an unknown variable, 
the precisions add. [This is the dual to the better-known relationship 'when 
independent variables are added, their variances add'.] 



Noisy-channel coding theorem for the Gaussian channel 

We have evaluated a maximal mutual information. Does it correspond to a 
maximum possible rate of error-free information transmission? One way of 
proving that this is so is to define a sequence of discrete channels, all derived 
from the Gaussian channel, with increasing numbers of inputs and outputs, 
and prove that the maximum mutual information of these channels tends to the 
asserted C. The noisy-channel coding theorem for discrete channels applies 
to each of these derived channels, thus we obtain a coding theorem for the 
continuous channel. Alternatively, we can make an intuitive argument for the 
coding theorem specific for the Gaussian channel. 
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Geometrical view of the noisy-channel coding theorem: sphere packing 

Consider a sequence x = (x\, . . . , xn) of inputs, and the corresponding output 
y, as defining two points in an N dimensional space. For large 7V, the noise 
power is very likely to be close (fractionally) to Na 2 . The output y is therefore 
very likely to be close to the surface of a sphere of radius \/Na 2 centred on x. 
Similarly, if the original signal x is generated at random subject to an average 
power constraint x 2 = v, then x is likely to lie close to a sphere, centred 
on the origin, of radius y/Nv; and because the total average power of y is 
v + a 2 , the received signal y is likely to lie on the surface of a sphere of radius 
\J N(v + a 2 ), centred on the origin. 

The volume of an iV-dimensional sphere of radius r is 



V(r, N) 



T JV/2 



r(Jv/2+i) 



(11.31) 



Now consider making a communication system based on non-confusable 
inputs x, that is, inputs whose spheres do not overlap significantly. The max- 
imum number S of non-confusable inputs is given by dividing the volume of 
the sphere of probable ys by the volume of the sphere for y given x: 



S < 



y/N{v + a 2 ) 
x/Na 2 



N 



(11.32) 



Thus the capacity is bounded by: 



C=llogM<ilog(l + ^). 



(11.33) 



A more detailed argument like the one used in the previous chapter can es- 
tablish equality. 



Back to the continuous channel 

Recall that the use of a real continuous channel with bandwidth W, noise 
spectral density No and power P is equivalent to N/T = 2W uses per second of 
a Gaussian channel with a 2 = No/2 and subject to the constraint x 2 ^ < P/2W. 
Substituting the result for the capacity of the Gaussian channel, we find the 
capacity of the continuous channel to be: 



C = W\og 1 



N W J 



bits per second. 



(11.34) 



This formula gives insight into the tradeoffs of practical communication. Imag- 
ine that we have a fixed power constraint. What is the best bandwidth to make 
use of that power? Introducing Wo = P/Nq, i.e., the bandwidth for which the 
signal-to-noise ratio is 1, figure 11.5 shows C /Wo = W/Wolog(l + Wq/W) as 
a function of W/Wo- The capacity increases to an asymptote of Wo log e. It 
is dramatically better (in terms of capacity for fixed power) to transmit at a 
low signal-to-noise ratio over a large bandwidth, than with high signal-to-noise 
in a narrow bandwidth; this is one motivation for wideband communication 
methods such as the 'direct sequence spread-spectrum' approach used in 3G 
mobile phones. Of course, you are not alone, and your electromagnetic neigh- 
bours may not be pleased if you use a large bandwidth, so for social reasons, 
engineers often have to make do with higher-power, narrow-bandwidth trans- 
mitters. 




Figure 11.5. Capacity versus 
bandwidth for a real channel: 
C/Wq = W/W log (1 + W /W) 
as a function of W/Wq. 
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11.4: What are the capabilities of practical error-correcting codes? 

► 11.4 What are the capabilities of practical error-correcting codes? 

Nearly all codes are good, but nearly all codes require exponential look-up 
tables for practical implementation of the encoder and decoder - exponential 
in the blocklength N. And the coding theorem required N to be large. 

By a practical error-correcting code, we mean one that can be encoded 
and decoded in a reasonable amount of time, for example, a time that scales 
as a polynomial function of the blocklength N - preferably linearly. 

The Shannon limit is not achieved in practice 

The non-constructive proof of the noisy-channel coding theorem showed that 
good block codes exist for any noisy channel, and indeed that nearly all block 
codes are good. But writing down an explicit and practical encoder and de- 
coder that are as good as promised by Shannon is still an unsolved problem. 

Very good codes. Given a channel, a family of block codes that achieve 
arbitrarily small probability of error at any communication rate up to 
the capacity of the channel are called 'very good' codes for that channel. 

Good codes are code families that achieve arbitrarily small probability of 
error at non-zero communication rates up to some maximum rate that 
may be less than the capacity of the given channel. 

Bad codes are code families that cannot achieve arbitrarily small probability 
of error, or that can only achieve arbitrarily small probability of error by 
decreasing the information rate to zero. Repetition codes are an example 
of a bad code family. (Bad codes are not necessarily useless for practical 
purposes.) 

Practical codes are code families that can be encoded and decoded in time 
and space polynomial in the blocklength. 

Most established codes are linear codes 

Let us review the definition of a block code, and then add the definition of a 
linear block code. 

An (N,K) block code for a channel Q is a list of S = 2 K codewords 
{xW,x( 2 ),. . . ,yS 2K) }, each of length N: x< s > € Ax- The signal to be 
encoded, s, which comes from an alphabet of size 2 K , is encoded as xM. 

A linear (N,K) block code is a block code in which the codewords {x' 8 '} 
make up a iGdimensional subspace of Ax- The encoding operation can 
be represented by an N x K binary matrix G T such that if the signal to 
be encoded, in binary notation, is s (a vector of length K bits), then the 
encoded signal is t = G T s modulo 2. 

The codewords {t} can be defined as the set of vectors satisfying Ht = 
0mod2, where H is the parity-check matrix of the code. 

For example the (7,4) Hamming code of section 1.2 takes K = 4 signal 
bits, s, and transmits them followed by three parity-check bits. The N = 7 
transmitted symbols are given by G T s mod 2. 

Coding theory was born with the work of Hamming, who invented a fam- 
ily of practical error-correcting codes, each able to correct one error in a 
block of length N, of which the repetition code R3 and the (7, 4) code are 



G T 



1 • • • 

• 1 • • 

• • 1 • 

• • • 1 
111 • 
•111 
1-11 
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the simplest. Since then most established codes have been generalizations of 
Hamming's codes: Bose-Chaudhury-Hocquenhem codes, Reed-Miiller codes, 
Reed-Solomon codes, and Goppa codes, to name a few. 

Convolutional codes 

Another family of linear codes are convolutional codes, which do not divide 
the source stream into blocks, but instead read and transmit bits continuously. 
The transmitted bits are a linear function of the past source bits. Usually the 
rule for generating the transmitted bits involves feeding the present source 
bit into a linear-feedback shift-register of length k, and transmitting one or 
more linear functions of the state of the shift register at each iteration. The 
resulting transmitted bit stream is the convolution of the source stream with 
a linear filter. The impulse-response function of this filter may have finite or 
infinite duration, depending on the choice of feedback shift-register. 
We will discuss convolutional codes in Chapter 48. 

Are linear codes 'good'? 

One might ask, is the reason that the Shannon limit is not achieved in practice 
because linear codes are inherently not as good as random codes? The answer 
is no, the noisy-channel coding theorem can still be proved for linear codes, 
at least for some channels (see Chapter 14), though the proofs, like Shannon's 
proof for random codes, are non-constructive. 

Linear codes are easy to implement at the encoding end. Is decoding a 
linear code also easy? Not necessarily. The general decoding problem (find 
the maximum likelihood s in the equation G T s + n = r) is in fact NP-complete 
(Berlekamp et ai, 1978). [NP-complete problems are computational problems 
that are all equally difficult and which are widely believed to require expo- 
nential computer time to solve in general.] So attention focuses on families of 
codes for which there is a fast decoding algorithm. 

Concatenation 

One trick for building codes with practical decoders is the idea of concatena- 
tion. 

An encoder-channel-decoder system C — > Q — > T> can be viewed as defining 
a super-channel Q' with a smaller probability of error, and with complex 
correlations among its errors. We can create an encoder C and decoder V for 
this super-channel Q' . The code consisting of the outer code C followed by 
the inner code C is known as a concatenated code. 

Some concatenated codes make use of the idea of interleaving. We read 
the data in blocks, the size of each block being larger than the blocklengths 
of the constituent codes C and C. After encoding the data of one block using 
code C, the bits are reordered within the block in such a way that nearby 
bits are separated from each other once the block is fed to the second code 
C. A simple example of an interleaver is a rectangular code or product code 
in which the data are arranged in a K 2 x K\ block, and encoded horizontally 
using an (N\,Ki) linear code, then vertically using a (^2,^2) linear code. 

> Exercise 11.3. ^ Show that either of the two codes can be viewed as the inner 
code or the outer code. 

As an example, figure 11.6 shows a product code in which we encode 
first with the repetition code R3 (also known as the Hamming code H(3, 1)) 



C'->C->Q->V-,V 

Q' 
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Figure 11.6. A product code, (a) 
A string 1011 encoded using a 
concatenated code consisting of 
two Hamming codes, H(3, 1) and 
H(7, 4). (b) a noise pattern that 
flips 5 bits, (c) The received 
vector, (d) After decoding using 
the horizontal (3, 1) decoder, and 
(e) after subsequently using the 
vertical (7, 4) decoder. The 
decoded vector matches the 
original. 

(d', e') After decoding in the other 
order, three errors still remain. 



horizontally then with H (7, 4) vertically. The blocklength of the concatenated 
code is 27. The number of source bits per codeword is four, shown by the 
small rectangle. 

We can decode conveniently (though not optimally) by using the individual 
decoders for each of the subcodes in some sequence. It makes most sense to 
first decode the code which has the lowest rate and hence the greatest error- 
correcting ability. 

Figure 11.6(c-e) shows what happens if we receive the codeword of fig- 
ure 11.6a with some errors (five bits flipped, as shown) and apply the decoder 
for if (3, 1) first, and then the decoder for H(7, 4). The first decoder corrects 
three of the errors, but erroneously modifies the third bit in the second row 
where there are two bit errors. The (7,4) decoder can then correct all three 
of these errors. 

Figure 11.6(d'-e') shows what happens if we decode the two codes in the 
other order. In columns one and two there are two errors, so the (7, 4) decoder 
introduces two extra errors. It corrects the one error in column 3. The (3, 1) 
decoder then cleans up four of the errors, but erroneously infers the second 
bit. 

Interleaving 

The motivation for interleaving is that by spreading out bits that are nearby 
in one code, we make it possible to ignore the complex correlations among the 
errors that are produced by the inner code. Maybe the inner code will mess 
up an entire codeword; but that codeword is spread out one bit at a time over 
several codewords of the outer code. So we can treat the errors introduced by 
the inner code as if they are independent. 

Other channel models 

In addition to the binary symmetric channel and the Gaussian channel, coding 
theorists keep more complex channels in mind also. 

Burst-error channels are important models in practice. Reed-Solomon 
codes use Galois fields (see Appendix C.l) with large numbers of elements 
(e.g. 2 16 ) as their input alphabets, and thereby automatically achieve a degree 
of burst-error tolerance in that even if 17 successive bits are corrupted, only 2 
successive symbols in the Galois field representation are corrupted. Concate- 
nation and interleaving can give further protection against burst errors. The 
concatenated Reed-Solomon codes used on digital compact discs are able to 
correct bursts of errors of length 4000 bits. 
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> Exercise 11.4.^' P- 189 l The technique of interleaving, which allows bursts of 
errors to be treated as independent, is widely used, but is theoretically 
a poor way to protect data against burst errors, in terms of the amount 
of redundancy required. Explain why interleaving is a poor method, 
using the following burst-error channel as an example. Time is divided 
into chunks of length N = 100 clock cycles; during each chunk, there 
is a burst with probability b = 0.2; during a burst, the channel is a bi- 
nary symmetric channel with / = 0.5. If there is no burst, the channel 
is an error-free binary channel. Compute the capacity of this channel 
and compare it with the maximum communication rate that could con- 
ceivably be achieved if one used interleaving and treated the errors as 
independent. 

Fading channels are real channels like Gaussian channels except that the 
received power is assumed to vary with time. A moving mobile phone is an 
important example. The incoming radio signal is reflected off nearby objects 
so that there are interference patterns and the intensity of the signal received 
by the phone varies with its location. The received power can easily vary by 
10 decibels (a factor of ten) as the phone's antenna moves through a distance 
similar to the wavelength of the radio signal (a few centimetres). 

► 11.5 The state of the art 

What are the best known codes for communicating over Gaussian channels? 
All the practical codes are linear codes, and are either based on convolutional 
codes or block codes. 



Convolutional codes, and codes based on them 



Textbook convolutional codes. The 'de facto standard' error-correcting 
code for satellite communications is a convolutional code with constraint 
length 7. Convolutional codes are discussed in Chapter 48. 

Concatenated convolutional codes. The above convolutional code can be 
used as the inner code of a concatenated code whose outer code is a Reed- 
Solomon code with eight-bit symbols. This code was used in deep space 
communication systems such as the Voyager spacecraft. For further 
reading about Reed-Solomon codes, see Lin and Costello (1983). 

The code for Galileo. A code using the same format but using a longer 
constraint length - 15 - for its convolutional code and a larger Reed- 
Solomon code was developed by the Jet Propulsion Laboratory (Swan- 
son, 1988). The details of this code are unpublished outside JPL, and the 
decoding is only possible using a room full of special-purpose hardware. 
In 1992, this was the best code known of rate Vi. 

Turbo codes. In 1993, Berrou, Glavieux and Thitimajshima reported work 
on turbo codes. The encoder of a turbo code is based on the encoders 
of two convolutional codes. The source bits are fed into each encoder, 
the order of the source bits being permuted in a random way, and the 
resulting parity bits from each constituent code are transmitted. 

The decoding algorithm involves iteratively decoding each constituent 
code using its standard decoding algorithm, then using the output of 
the decoder as the input to the other decoder. This decoding algorithm 



Ci 



Figure 11.7. The encoder of a 
turbo code. Each box Ci, C*2, 
contains a convolutional code. 
The source bits are reordered 
using a permutation n before they 
are fed to C2. The transmitted 
codeword is obtained by 
concatenating or interleaving the 
outputs of the two convolutional 
codes. The random permutation 
is chosen when the code is 
designed, and fixed thereafter. 
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is an instance of a message-passing algorithm called the sum-product 
algorithm. 

Turbo codes are discussed in Chapter 48, and message passing in Chap- 
ters 16, 17, 25, and 26. 

Block codes 

Gallager's low-density parity-check codes. The best block codes known 
for Gaussian channels were invented by Gallager in 1962 but were 
promptly forgotten by most of the coding theory community. They were 
rediscovered in 1995 and shown to have outstanding theoretical and prac- 
tical properties. Like turbo codes, they are decoded by message-passing 
algorithms. 

We will discuss these beautifully simple codes in Chapter 47. 

The performances of the above codes are compared for Gaussian channels 
in figure 47.17, p. 568. 

► 11.6 Summary 

Random codes are good, but they require exponential resources to encode 
and decode them. 

Non-random codes tend for the most part not to be as good as random 
codes. For a non-random code, encoding may be easy, but even for 
simply-defined linear codes, the decoding problem remains very difficult. 

The best practical codes (a) employ very large block sizes; (b) are based 
on semi-random code constructions; and (c) make use of probability- 
based decoding algorithms. 
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Figure 11.8. A low-density 
parity-check matrix and the 
corresponding graph of a rate- V4 
low-density parity-check code 
with blocklcngth N = 16, and 
M = 12 constraints. Each white 
circle represents a transmitted bit. 
Each bit participates in j = 3 
constraints, represented by 
squares. Each constraint forces 
the sum of the k = 4 bits to which 
it is connected to be even. This 
code is a (16, 4) code. 
Outstanding performance is 
obtained when the blocklcngth is 
increased to N ~ 10 000. 



► 11.7 Nonlinear codes 

Most practically used codes are linear, but not all. Digital soundtracks are 
encoded onto cinema film as a binary pattern. The likely errors affecting the 
film involve dirt and scratches, which produce large numbers of Is and 0s 
respectively. We want none of the codewords to look like all- Is or all-Os, so 
that it will be easy to detect errors caused by dirt and scratches. One of the 
codes used in digital cinema sound systems is a nonlinear (8, 6) code consisting 
of 64 of the (4) binary patterns of weight 4. 



► 11.8 Errors other than noise 

Another source of uncertainty for the receiver is uncertainty about the tim- 
ing of the transmitted signal x(t). In ordinary coding theory and infor- 
mation theory, the transmitter's time t and the receiver's time u are as- 
sumed to be perfectly synchronized. But if the receiver receives a signal 
y(u), where the receiver's time, u, is an imperfectly known function u(t) 
of the transmitter's time t, then the capacity of this channel for commu- 
nication is reduced. The theory of such channels is incomplete, compared 
with the synchronized channels we have discussed thus far. Not even the ca- 
pacity of channels with synchronization errors is known (Levenshtein, 1966; 
Ferreira et ai, 1997); codes for reliable communication over channels with 
synchronization errors remain an active research area (Davey and MacKay, 
2001). 
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Further reading 

For a review of the history of spread-spectrum methods, see Scholtz (1982). 



► 11.9 Exercises 

The Gaussian channel 

> Exercise 11.5. p - 190 l Consider a Gaussian channel with a real input x, and 

signal to noise ratio vja 2 . 

(a) What is its capacity C? 

(b) If the input is constrained to be binary, x € {i-y/u}, what is the 
capacity C" of this constrained channel? 

(c) If in addition the output of the channel is thresholded using the 
mapping 

what is the capacity C" of the resulting channel? 

(d) Plot the three capacities above as a function of v/a 2 from 0.1 to 2. 
[You'll need to do a numerical integral to evaluate C] 

> Exercise 11.6. ^ For large integers K and TV", what fraction of all binary error- 

correcting codes of length N and rate R = K/N are linear codes? [The 
answer will depend on whether you choose to define the code to be an 
ordered list of 2 K codewords, that is, a mapping from s e {1,2,... ,2 K } 
to x( s \ or to define the code to be an unordered list, so that two codes 
consisting of the same codewords are identical. Use the latter definition: 
a code is a set of codewords; how the encoder operates is not part of the 
definition of the code.] 



Erasure channels 

> Exercise 11.7. ^ Design a code for the binary erasure channel, and a decoding 
algorithm, and evaluate their probability of error. [The design of good 
codes for erasure channels is an active research area (Spielman, 1996; 
Byers et at, 1998); see also Chapter 50.] 

t> bxercise 11.8. [5] Desi gn a code for the g-ary erasure channel, whose input x is 
drawn from 0, 1, 2, 3, . . . , (q — 1), and whose output y is equal to x with 
probability (1 — /) and equal to ? otherwise. [This erasure channel is a 
good model for packets transmitted over the internet, which are either 
received reliably or are lost.] 

bxercise 11.9. [5 ' p - 1901 How do redundant arrays of independent disks (RAID) 
work? These are information storage systems consisting of about ten 
disk drives, of which any two or three can be disabled and the others are 
able to still able to reconstruct any requested file. What codes are used, 
and how far are these systems from the Shannon limit for the problem 
they are solving? How would you design a better RAID system? Some 
information is provided in the solution section. See http://www.acnc. 
com/raid2.html; see also Chapter 50. 



[Some people say RAID stands for 
'redundant array of inexpensive 
disks', but I think that's silly - 
RAID would still be a good idea 
even if the disks were expensive!] 
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► 11.10 Solutions 

Solution to exercise 11.1 (p. 181). Introduce a Lagrange multiplier A for the 
power constraint and another, /i, for the constraint of normalization of P{x). 

F = I(X;Y) - XfdxP(x)x 2 - fj, J dx P(x) (11.36) 



= JdxP{x) JdyP(y\x)ln 



\x — fl 



(11.37) 



/ 



P(y\x) 
P(y) 

Make the functional derivative with respect to P(x*). 

-h^h^m^-y (1L38 » 

The final factor SP(y)/5P(x*) is found, using P(y) = Jdx P(x)P(y \x), to be 
P(y\x*), and the whole of the last term collapses in a puff of smoke to 1, 
which can be absorbed into the /i term. 

Substitute P(y \ x) = exp( — (y — x) 2 /2a 2 )/\ / 2na 2 and set the derivative to 
zero: 

JdyP(y\x)\n^^--\x 2 -p' = (11.39) 

dl/ ^^^g 7 ^^ In [P(VM = -X*> - // - \ (11-40) 

This condition must be satisfied by ln[P(y)a] for all x. 

Writing a Taylor expansion of ln[_P(y)cr] = a+by+cy 2 +- ■ •, only a quadratic 
function ln[P(y)cr] = a+cy 2 would satisfy the constraint (11.40). (Any higher 
order terms y p , p > 2, would produce terms in x p that are not present on 
the right-hand side.) Therefore P(y) is Gaussian. We can obtain this optimal 
output distribution by using a Gaussian input distribution P(x). 

Solution to exercise 11.2 (p. 181). Given a Gaussian input distribution of vari- 
ance v, the output distribution is Normal (0, v + a 2 ), since x and the noise 
are independent random variables, and variances add for independent random 
variables. The mutual information is: 

I(X;Y) = J dx dy P(x)P(y | x) log P(y | x) - J dy P(y) log P(y) (11.41) 

= \^V 2 -\ Xo ^ 2 ^ 

Solution to exercise 11.4 (p. 186). The capacity of the channel is one minus 
the information content of the noise that it adds. That information content is, 
per chunk, the entropy of the selection of whether the chunk is bursty, H2(b), 
plus, with probability b, the entropy of the flipped bits, N, which adds up 
to H2(b) + Nb per chunk (roughly; accurate if N is large). So, per bit, the 
capacity is, for N = 100, 

^H 2 (b) + b^j = 1 -0.207 = 0.793. (11.44) 

In contrast, interleaving, which treats bursts of errors as independent, causes 
the channel to be treated as a binary symmetric channel with / = 0.2 x 0.5 = 
0.1, whose capacity is about 0.53. 
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Interleaving throws away the useful information about the correlated- 
ness of the errors. Theoretically, we should be able to communicate about 
(0.79/0.53) ~ 1.6 times faster using a code and decoder that explicitly treat 
bursts as bursts. 



Solution to exercise 11.5 (p. 188). 



(a) Putting together the results of exercises 11.1 and 11.2, we deduce that 
a Gaussian channel with real input x, and signal to noise ratio v/a 2 has 
capacity 



C 



1 



;log 



(11.45) 



(b) If the input is constrained to be binary, x € {±^/v}, the capacity is 
achieved by using these two inputs with equal probability. The capacity 
is reduced to a somewhat messy integral, 



C" 



&yN(y;Q)\og 



f° 

N(y;0)- / 

J — C 



dyP(y) log P(y), (11.46) 



where N(y;x) = (l/\/27r) exp[(y — x) 2 /2], x = \fvja, and P(y) = 
[N(y; x) + N(y; — x)]/2. This capacity is smaller than the unconstrained 
capacity (11.45), but for small signal-to-noise ratio, the two capacities 
are close in value. 

(c) If the output is thresholded, then the Gaussian channel is turned into 
a binary symmetric channel whose transition probability is given by the 
error function $ defined on page 156. The capacity is 




Figure 11.9. Capacities (from top 
to bottom in each graph) C, C", 
and C" , versus the signal-to-noise 
ratio (y/v/a). The lower graph is 
a log-log plot. 



C" = 1 - H 2 (f), where / = $0/a). 



(11.47) 



Solution to exercise 11.9 (p. 188). There are several RAID systems. One of 
the easiest to understand consists of 7 disk drives which store data at rate 
4/7 using a (7,4) Hamming code: each successive four bits are encoded with 
the code and the seven codeword bits are written one to each disk. Two or 
perhaps three disk drives can go down and the others can recover the data. 
The effective channel model here is a binary erasure channel, because it is 
assumed that we can tell when a disk is dead. 

It is not possible to recover the data for some choices of the three dead 
disk drives; can you see why? 

t> Exercise 11.10. ^' p ' 190 l Give an example of three disk drives that, if lost, lead 
to failure of the above RAID system, and three that can be lost without 
failure. 

Solution to exercise 11.10 (p. 190). The (7,4) Hamming code has codewords 
of weight 3. If any set of three disk drives corresponding to one of those code- 
words is lost, then the other four disks can only recover 3 bits of information 
about the four source bits; a fourth bit is lost. [cf. exercise 13.13 (p. 220) with 
q = 2: there are no binary MDS codes. This deficit is discussed further in 
section 13.11.] 

Any other set of three disk drives can be lost without problems because 
the corresponding four by four submatrix of the generator matrix is invertible. 
A better code would be the digital fountain - see Chapter 50. 
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Part III 

Further Topics in Information Theory 
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About Chapter 12 



In Chapters 1—11, we concentrated on two aspects of information theory and 
coding theory: source coding - the compression of information so as to make 
efficient use of data transmission and storage channels; and channel coding - 
the redundant encoding of information so as to be able to detect and correct 
communication errors. 

In both these areas we started by ignoring practical considerations, concen- 
trating on the question of the theoretical limitations and possibilities of coding. 
We then discussed practical source-coding and channel-coding schemes, shift- 
ing the emphasis towards computational feasibility. But the prime criterion 
for comparing encoding schemes remained the efficiency of the code in terms 
of the channel resources it required: the best source codes were those that 
achieved the greatest compression; the best channel codes were those that 
communicated at the highest rate with a given probability of error. 

In this chapter we now shift our viewpoint a little, thinking of ease of 
information retrieval as a primary goal. It turns out that the random codes 
which were theoretically useful in our study of channel coding are also useful 
for rapid information retrieval. 

Efficient information retrieval is one of the problems that brains seem to 
solve effortlessly, and content-addressable memory is one of the topics we will 
study when we look at neural networks. 
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12 



Hash Codes: Codes for Efficient 
Information Retrieval 



► 12.1 The information-retrieval problem 



A simple example of an information-retrieval problem is the task of imple- 
menting a phone directory service, which, in response to a person's name, 
returns (a) a confirmation that that person is listed in the directory; and (b) 
the person's phone number and other details. We could formalize this prob- 
lem as follows, with S being the number of names that must be stored in the 
directory. 

You are given a list of S binary strings of length N bits, {x' 1 ), . . . ,x^ s '}, 
where S is considerably smaller than the total number of possible strings, 2 N . 
We will call the superscript 's' in x' s ' the record number of the string. The 
idea is that s runs over customers in the order in which they are added to the 
directory and X M is the name of customer s. We assume for simplicity that 
all people have names of the same length. The name length might be, say, 
N = 200 bits, and we might want to store the details of ten million customers, 
so S ~ 10 7 ~ 2 23 . We will ignore the possibility that two customers have 
identical names. 

The task is to construct the inverse of the mapping from s to xw, i.e., to 
make a system that, given a string x, returns the value of s such that x = x( s ) 
if one exists, and otherwise reports that no such s exists. (Once we have the 
record number, we can go and look in memory location s in a separate memory 
full of phone numbers to find the required number.) The aim, when solving 
this task, is to use minimal computational resources in terms of the amount 
of memory used to store the inverse mapping from x to s and the amount of 
time to compute the inverse mapping. And, preferably, the inverse mapping 
should be implemented in such a way that further new strings can be added 
to the directory in a small amount of computer time too. 



string length 
number of strings 
number of possible 
strings 



N ~ 200 
S ~ 2 23 
2-iv ^ 2 200 



Figure 12.1. Cast of characters. 



Some standard solutions 

The simplest and dumbest solutions to the information-retrieval problem are 
a look-up table and a raw list. 

The look-up table is a piece of memory of size 2 N log 2 S, log 2 S being the 
amount of memory required to store an integer between 1 and S. In 
each of the 2 W locations, we put a zero, except for the locations x that 
correspond to strings x.( s \ into which we write the value of s. 

The look-up table is a simple and quick solution, but only if there is 
sufficient memory for the table, and if the cost of looking up entries in 
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memory is independent of the memory size. But in our definition of the 
task, we assumed that N is about 200 bits or more, so the amount of 
memory required would be of size 2 200 ; this solution is completely out 
of the question. Bear in mind that the number of particles in the solar 
system is only about 2 190 . 

The raw list is a simple list of ordered pairs (s 7 x.^) ordered by the value 
of s. The mapping from x to s is achieved by searching through the list 
of strings, starting from the top, and comparing the incoming string x 
with each record x.^ until a match is found. This system is very easy 
to maintain, and uses a small amount of memory, about SN bits, but 
is rather slow to use, since on average five million pairwise comparisons 
will be made. 

> Exercise 12.l\ 2 ' P- 202 l Show that the average time taken to find the required 
string in a raw list, assuming that the original names were chosen at 
random, is about S + N binary comparisons. (Note that you don't 
have to compare the whole string of length N, since a comparison can 
be terminated as soon as a mismatch occurs; show that you need on 
average two binary comparisons per incorrect string match.) Compare 
this with the worst-case search time - assuming that the devil chooses 
the set of strings and the search key. 

The standard way in which phone directories are made improves on the look-up 
table and the raw list by using an alphabetically-ordered list. 

Alphabetical list. The strings {x^ s '} are sorted into alphabetical order. 
Searching for an entry now usually takes less time than was needed 
for the raw list because we can take advantage of the sortedness; for 
example, we can open the phonebook at its middle page, and compare 
the name we find there with the target string; if the target is 'greater' 
than the middle string then we know that the required string, if it exists, 
will be found in the second half of the alphabetical directory. Otherwise, 
we look in the first half. By iterating this splitting-in-the-middle proce- 
dure, we can identify the target string, or establish that the string is not 
listed, in |~log 2 S~\ string comparisons. The expected number of binary 
comparisons per string comparison will tend to increase as the search 
progresses, but the total number of binary comparisons required will be 
no greater than |~log 2 S]N. 

The amount of memory required is the same as that required for the raw 
list. 

Adding new strings to the database requires that we insert them in the 
correct location in the list. To find that location takes about [log 2 S~\ 
binary comparisons. 

Can we improve on the well-established alphabetized list? Let us consider 
our task from some new viewpoints. 

The task is to construct a mapping x — > s from N bits to log 2 S bits. This 
is a pseudo-invertible mapping, since for any x that maps to a non-zero s, the 
customer database contains the pair (s,xM) that takes us back. Where have 
we come across the idea of mapping from N bits to M bits before? 

We encountered this idea twice: first, in source coding, we studied block 
codes which were mappings from strings of N symbols to a selection of one 
label in a list. The task of information retrieval is similar to the task (which 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 

12.2: Hash codes 195 



we never actually solved) of making an encoder for a typical-set compression 
code. 

The second time that we mapped bit strings to bit strings of another 
dimensionality was when we studied channel codes. There, we considered 
codes that mapped from K bits to N bits, with N greater than K, and we 
made theoretical progress using random codes. 

In hash codes, we put together these two notions. We will study random 
codes that map from N bits to M bits where M is smaller than N. 

The idea is that we will map the original high-dimensional space down into 
a lower-dimensional space, one in which it is feasible to implement the dumb 
look-up table method which we rejected a moment ago. 



string length 


N ~ 


200 


number of strings 


S ~ 


2 23 


size of hash function 


M- 


; 30 bits 


size of hash table 


T = 


2 M 






2 30 



Figure 12.2. 

characters. 



Revised cast of 



► 12.2 Hash codes 

First we will describe how a hash code works, then we will study the properties 
of idealized hash codes. A hash code implements a solution to the information- 
retrieval problem, that is, a mapping from x to s, with the help of a pseudo- 
random function called a hash function, which maps the iV-bit string x to an 
M-bit string h(x), where M is smaller than N. M is typically chosen such that 
the 'table size' T ~ 2 M is a little bigger than S - say, ten times bigger. For 
example, if we were expecting S to be about a million, we might map x into 
a 30-bit hash h (regardless of the size N of each item x). The hash function 
is some fixed deterministic function which should ideally be indistinguishable 
from a fixed random code. For practical purposes, the hash function must be 
quick to compute. 

Two simple examples of hash functions are: 

Division method. The table size T is a prime number, preferably one that 
is not close to a power of 2. The hash value is the remainder when the 
integer x is divided by T. 

Variable string addition method. This method assumes that x is a string 
of bytes and that the table size T is 256. The characters of x are added, 
modulo 256. This hash function has the defect that it maps strings that 
are anagrams of each other onto the same hash. 

It may be improved by putting the running total through a fixed pseu- 
dorandom permutation after each character is added. In the variable 
string exclusive-or method with table size < 65 536, the string is hashed 
twice in this way, with the initial running total being set to and 1 
respectively (algorithm 12.3). The result is a 16-bit hash. 

Having picked a hash function h(x), we implement an information retriever 
as follows. (See figure 12.4.) 

Encoding. A piece of memory called the hash table is created of size 2 M b 
memory units, where b is the amount of memory needed to represent an 
integer between and S. This table is initially set to zero throughout. 
Each memory x^ s ' is put through the hash function, and at the location 
in the hash table corresponding to the resulting vector h( s ) = h(x( s '), 
the integer s is written - unless that entry in the hash table is already 
occupied, in which case we have a collision between x^ s ' and some earlier 
x( s ') which both happen to have the same hash code. Collisions can be 
handled in various ways - we will discuss some in a moment - but first 
let us complete the basic picture. 
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unsigned char Rand8 L256J ; 


/ / 


• 

This array contains a random 






permutation from 0. .255 to 0. .255 


int Hash (char *x) { 


// 


x is a pointer to the first char; 


int h; 


// 


*x is the first character 


unsigned char hi, h2; 






if Ox == 0) return 0; 


// 


Special handling of empty string 


hi = *x; h2 = *x + 1; 


// 


Initialize two hashes 


x++; 


// 


Proceed to the next character 


while Ox) { 






hi = Rand8 [hi * *x] ; 


// 


Exclusive-or with the two hashes 


h2 = Rand8 [h2 " *x] ; 


// 


and put through the randomizer 


x++; 






} 


// 


End of string is reached when *x=0 


h = ((int) (hl)«8) I 


// 


Shift hi left 8 bits and add h2 


(int) h2 ; 






return h ; 

} 


// 


Hash is concatenation of hi and h2 



Algorithm 12.3. C code 
implementing the variable string 
exclusive-or method to create a 
hash h in the range ... 65 535 
from a string x. Author: Thomas 
Niemann. 



Strings 



N bits 



Hash 

function Hash table 

*" hashes 



M bits 



h(x< 2 >) 



Figure 12.4. Use of hash functions 
for information retrieval. For each 
string x' s ', the hash h = h(x' s ') 
is computed, and the value of s is 
written into the hth row of the 
hash table. Blank rows in the 
hash table contain the value zero. 
The table size is T = 2 M . 
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12.3: Collision resolution 

Decoding. To retrieve a piece of information corresponding to a target vector 
x, we compute the hash h of x and look at the corresponding location 
in the hash table. If there is a zero, then we know immediately that the 
string x is not in the database. The cost of this answer is the cost of one 
hash-function evaluation and one look-up in the table of size 2 M . If, on 
the other hand, there is a non-zero entry s in the table, there are two 
possibilities: either the vector x is indeed equal to x^ s '; or the vector x.^ 
is another vector that happens to have the same hash code as the target 
x. (A third possibility is that this non-zero entry might have something 
to do with our yet- to-be-discussed collision-resolution system.) 

To check whether x is indeed equal to x^, we take the tentative answer 
s, look up x( s ) in the original forward database, and compare it bit by 
bit with x; if it matches then we report s as the desired answer. This 
successful retrieval has an overall cost of one hash-function evaluation, 
one look-up in the table of size 2 , another look-up in a table of size 
5, and N binary comparisons - which may be much cheaper than the 
simple solutions presented in section 12.1. 

t> Exercise 12.2.^' P- 202 l If we have checked the first few bits of x( s ) with x and 
found them to be equal, what is the probability that the correct entry 
has been retrieved, if the alternative hypothesis is that x is actually not 
in the database? Assume that the original source strings are random, 
and the hash function is a random hash function. How many binary 
evaluations are needed to be sure with odds of a billion to one that the 
correct entry has been retrieved? 

The hashing method of information retrieval can be used for strings x of 
arbitrary length, if the hash function h(x) can be applied to strings of any 
length. 

► 12.3 Collision resolution 

We will study two ways of resolving collisions: appending in the table, and 
storing elsewhere. 

Appending in table 

When encoding, if a collision occurs, we continue down the hash table and 
write the value of s into the next available location in memory that currently 
contains a zero. If we reach the bottom of the table before encountering a 
zero, we continue from the top. 

When decoding, if we compute the hash code for x and find that the s 
contained in the table doesn't point to an that matches the cue x, we 
continue down the hash table until we either find an s whose x( s ) does match 
the cue x, in which case we are done, or else encounter a zero, in which case 
we know that the cue x is not in the database. 

For this method, it is essential that the table be substantially bigger in size 
than S. If 2 M < S then the encoding rule will become stuck with nowhere to 
put the last strings. 

Storing elsewhere 

A more robust and flexible method is to use pointers to additional pieces of 
memory in which collided strings are stored. There are many ways of doing 
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this. As an example, we could store in location h in the hash table a pointer 
(which must be distinguishable from a valid record number s) to a 'bucket' 
where all the strings that have hash code h are stored in a sorted list. The 
encoder sorts the strings in each bucket alphabetically as the hash table and 
buckets are created. 

The decoder simply has to go and look in the relevant bucket and then 
check the short list of strings that are there by a brief alphabetical search. 

This method of storing the strings in buckets allows the option of making 
the hash table quite small, which may have practical benefits. We may make it 
so small that almost all strings are involved in collisions, so all buckets contain 
a small number of strings. It only takes a small number of binary comparisons 
to identify which of the strings in the bucket matches the cue x. 



► 12.4 Planning for collisions: a birthday problem 

4! yj Exercise 12.3 J 2 ' p ' 202 ' If we wish to store 5 entries using a hash function whose 
output has M bits, how many collisions should we expect to happen, 
assuming that our hash function is an ideal random function? What 
size M of hash table is needed if we would like the expected number of 
collisions to be smaller than 1? 

What size M of hash table is needed if we would like the expected number 
of collisions to be a small fraction, say 1%, of 5? 

[Notice the similarity of this problem to exercise 9.20 (p. 156).] 



► 12.5 Other roles for hash codes 

Checking arithmetic 

If you wish to check an addition that was done by hand, you may find useful 
the method of casting out nines. In casting out nines, one finds the sum, 
modulo nine, of all the digits of the numbers to be summed and compares 
it with the sum, modulo nine, of the digits of the putative answer. [With a 
little practice, these sums can be computed much more rapidly than the full 
original addition.] 

Example 12.4. In the calculation shown in the margin the sum, modulo nine, of 189 
the digits in 189+1254+238 is 7, and the sum, modulo nine, of 1+6+8+1 +1254 
is 7. The calculation thus passes the casting-out-nines test. + 238 

1681 

Casting out nines gives a simple example of a hash function. For any 
addition expression of the form a + b + c + ■ ■ ■, where a, 6, c, . . . are decimal 
numbers we define he {0, 1, 2, 3, 4, 5, 6, 7, 8} by 

h(a + b + c + ■ ■ •) = sum modulo nine of all digits in a,b,c ; (12-1) 

then it is nice property of decimal arithmetic that if 

a + 6+cH = m+n + o-\ (12.2) 

then the hashes h(a + b + c + ■ ■ •) and h(m + n + o + ■■■) are equal. 

> Exercise 12. 5 J 1 ' p ' 203 l What evidence does a correct casting-out-nines match 
give in favour of the hypothesis that the addition has been done cor- 
rectly? 
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12.5: Other roles for hash codes 
Error detection among friends 

Are two files the same? If the files are on the same computer, we could just 
compare them bit by bit. But if the two files are on separate machines, it 
would be nice to have a way of confirming that two files are identical without 
having to transfer one of the files from A to B. [And even if we did transfer one 
of the files, we would still like a way to confirm whether it has been received 
without modifications!] 

This problem can be solved using hash codes. Let Alice and Bob be the 
holders of the two files; Alice sent the file to Bob, and they wish to confirm 
it has been received without error. If Alice computes the hash of her file and 
sends it to Bob, and Bob computes the hash of his file, using the same M-bit 
hash function, and the two hashes match, then Bob can deduce that the two 
files are almost surely the same. 

Example 12.6. What is the probability of a false negative, i.e., the probability, 
given that the two files do differ, that the two hashes are nevertheless 
identical? 

If we assume that the hash function is random and that the process that causes 
the files to differ knows nothing about the hash function, then the probability 
of a false negative is 2~ M . □ 
A 32-bit hash gives a probability of false negative of about 10~ 10 . It is 
common practice to use a linear hash function called a 32-bit cyclic redundancy 
check to detect errors in files. (A cyclic redundancy check is a set of 32 parity- 
check bits similar to the 3 parity-check bits of the (7,4) Hamming code.) 



To have a false-negative rate smaller than one in a billion, M = 32 
bits is plenty, if the errors are produced by noise. 



> Exercise 12.7.^ ' p ' ' Such a simple parity-check code only detects errors; it 
doesn't help correct them. Since error- correcting codes exist, why not 
use one of them to get some error-correcting capability too? 

Tamper detection 

What if the differences between the two files are not simply 'noise', but are 
introduced by an adversary, a clever forger called Fiona, who modifies the 
original file to make a forgery that purports to be Alice's file? How can Alice 
make a digital signature for the file so that Bob can confirm that no-one has 
tampered with the file? And how can we prevent Fiona from listening in on 
Alice's signature and attaching it to other files? 

Let's assume that Alice computes a hash function for the file and sends it 
securely to Bob. If Alice computes a simple hash function for the file like the 
linear cyclic redundancy check, and Fiona knows that this is the method of 
verifying the file's integrity, Fiona can make her chosen modifications to the 
file and then easily identify (by linear algebra) a further 32-or-so single bits 
that, when flipped, restore the hash function of the file to its original value. 
Linear hash functions give no security against forgers. 

We must therefore require that the hash function be hard to invert so that 
no-one can construct a tampering that leaves the hash function unaffected. 
We would still like the hash function to be easy to compute, however, so that 
Bob doesn't have to do hours of work to verify every file he received. Such 
a hash function - easy to compute, but hard to invert - is called a one-way 
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hash function. Finding such functions is one of the active research areas of 
cryptography. 

A hash function that is widely used in the free software community to 
confirm that two files do not differ is MD5, which produces a 128-bit hash. The 
details of how it works are quite complicated, involving convoluted exclusive- 
or-ing and if-ing and and-ing. 1 

Even with a good one-way hash function, the digital signatures described 
above are still vulnerable to attack, if Fiona has access to the hash function. 
Fiona could take the tampered file and hunt for a further tiny modification to 
it such that its hash matches the original hash of Alice's file. This would take 
some time - on average, about 2 32 attempts, if the hash function has 32 bits - 
but eventually Fiona would find a tampered file that matches the given hash. 
To be secure against forgery, digital signatures must either have enough bits 
for such a random search to take too long, or the hash function itself must be 
kept secret. 



Fiona has to hash 2 M files to cheat. 2 32 file modifications is not 
very many, so a 32-bit hash function is not large enough for forgery 
prevention. 



Another person who might have a motivation for forgery is Alice herself. 
For example, she might be making a bet on the outcome of a race, without 
wishing to broadcast her prediction publicly; a method for placing bets would 
be for her to send to Bob the bookie the hash of her bet. Later on, she could 
send Bob the details of her bet. Everyone can confirm that her bet is consis- 
tent with the previously publicized hash. [This method of secret publication 
was used by Isaac Newton and Robert Hooke when they wished to establish 
priority for scientific ideas without revealing them. Hooke's hash function 
was alphabetization as illustrated by the conversion of UT TENSIO, SIC VIS 
into the anagram CEIIINOSSSTTUV.] Such a protocol relies on the assumption 
that Alice cannot change her bet after the event without the hash coming 
out wrong. How big a hash function do we need to use to ensure that Alice 
cannot cheat? The answer is different from the size of the hash we needed in 
order to defeat Fiona above, because Alice is the author of both files. Alice 
could cheat by searching for two files that have identical hashes to each other. 
For example, if she'd like to cheat by placing two bets for the price of one, 
she could make a large number Ni of versions of bet one (differing from each 
other in minor details only), and a large number N 2 of versions of bet two, and 
hash them all. If there's a collision between the hashes of two bets of different 
types, then she can submit the common hash and thus buy herself the option 
of placing either bet. 

Example 12.8. If the hash has M bits, how big do N-y and N 2 need to be for 
Alice to have a good chance of finding two different bets with the same 
hash? 

This is a birthday problem like exercise 9.20 (p. 156) . If there are Ni Montagues 
and N 2 Capulets at a party, and each is assigned a 'birthday' of M bits, the 
expected number of collisions between a Montague and a Capulet is 

N 1 N 2 2- M , (12.3) 



x http : //www . f reesof t . org/CIE/RFC/1321/3 . htm 
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so to minimize the number of files hashed, N\ + N2, Alice should make Ni 
and N2 equal, and will need to hash about 2 M / 2 files until she finds two that 
match. □ 



Alice has to hash 2 M / 2 files to cheat. [This is the square root of the 
number of hashes Fiona had to make.] 



If Alice has the use of C = 10 6 computers for T = 10 years, each computer 
taking t = 1 ns to evaluate a hash, the bet-communication system is secure 
against Alice's dishonesty only if M 3> 21og 2 CT/t ~ 160 bits. 



Further reading 

The Bible for hash codes is volume 3 of Knuth (1968). I highly recommend the 
story of Doug Mcllroy's spell program, as told in section 13.8 of Programming 
Pearls (Bentley, 2000). This astonishing piece of software makes use of a 64- 
kilobyte data structure to store the spellings of all the words of 75 000- word 
dictionary. 



► 12.6 Further exercises 

Exercise 12.9.^ What is the shortest the address on a typical international 
letter could be, if it is to get to a unique human recipient? (Assume 
the permitted characters are [A-Z.0-9].) How long are typical email 
addresses? 

<^M% Exercise 12.10. ^' P' 2 ^ 3 ] How long does a piece of text need to be for you to be 
pretty sure that no human has written that string of characters before? 
How many notes are there in a new melody that has not been composed 
before? 

\> Exercise 12.11. ^' p - 204 l Pattern recognition by molecules. 

Some proteins produced in a cell have a regulatory role. A regulatory 
protein controls the transcription of specific genes in the genome. This 
control often involves the protein's binding to a particular DNA sequence 
in the vicinity of the regulated gene. The presence of the bound protein 
either promotes or inhibits transcription of the gene. 

(a) Use information-theoretic arguments to obtain a lower bound on 
the size of a typical protein that acts as a regulator specific to one 
gene in the whole human genome. Assume that the genome is a 
sequence of 3 x 10 9 nucleotides drawn from a four letter alphabet 
{A, C,G, T}; a protein is a sequence of amino acids drawn from a 
twenty letter alphabet. [Hint: establish how long the recognized 
DNA sequence has to be in order for that sequence to be unique 
to the vicinity of one gene, treating the rest of the genome as a 
random sequence. Then discuss how big the protein must be to 
recognize a sequence of that length uniquely] 

(b) Some of the sequences recognized by DNA-binding regulatory pro- 
teins consist of a subsequence that is repeated twice or more, for 
example the sequence 



GCCCCCCACCCCTGCCCCC 



(12.4) 
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is a binding site found upstream of the alpha-actin gene in humans. 
Does the fact that some binding sites consist of a repeated subse- 
quence influence your answer to part (a)? 

► 12.7 Solutions 

Solution to exercise 12.1 (p. 194). First imagine comparing the string x with 
another random string x' s ' . The probability that the first bits of the two 
strings match is 1/2. The probability that the second bits match is 1/2. As- 
suming we stop comparing once we hit the first mismatch, the expected number 
of matches is 1, so the expected number of comparisons is 2 (exercise 2.34, 
p.38). 

Assuming the correct string is located at random in the raw list, we will 
have to compare with an average of 5/2 strings before we find it, which costs 
25/2 binary comparisons; and comparing the correct strings takes N binary 
comparisons, giving a total expectation of 5 + N binary comparisons, if the 
strings are chosen at random. 

In the worst case (which may indeed happen in practice), the other strings 
are very similar to the search key, so that a lengthy sequence of comparisons 
is needed to find each mismatch. The worst case is when the correct string 
is last in the list, and all the other strings differ in the last bit only, giving a 
requirement of SN binary comparisons. 

Solution to exercise 12.2 (p. 197). The likelihood ratio for the two hypotheses, 
Ho- x' s ) = x, and Hi: x' s ' ^ x, contributed by the datum 'the first bits of 
x( s ) and x are equal' is 

P(Datum|^ ) _ 1 _ o 

P(Datum|Wi) 1/2 1 ' 

If the first r bits all match, the likelihood ratio is 2 r to one. On finding that 
30 bits match, the odds are a billion to one in favour of Ho, assuming we start 
from even odds. [For a complete answer, we should compute the evidence 
given by the prior information that the hash entry s has been found in the 
table at h(x). This fact gives further evidence in favour of Ho-} 

Solution to exercise 12.3 (p. 198). Let the hash function have an output al- 
phabet of size T = 2 M . If M were equal to log 2 5 then we would have exactly 
enough bits for each entry to have its own unique hash. The probability that 
one particular pair of entries collide under a random hash function is 1 /T. The 
number of pairs is 5(5 — l)/2. So the expected number of collisions between 
pairs is exactly 

5(5-l)/(2T). (12.6) 
If we would like this to be smaller than 1, then we need T > 5(5 — l)/2 so 

M>21og 2 5. (12.7) 

We need twice as many bits as the number of bits, log 2 5, that would be 
sufficient to give each entry a unique name. 

If we are happy to have occasional collisions, involving a fraction / of the 
names 5, then we need T > S/f (since the probability that one particular 
name is collided- with is / ~ S/T) so 



M>log 2 5 + log 2 [l//], 



(12.8) 
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12.7: Solutions 

which means for / ~ 0.01 that we need an extra 7 bits above log 2 S. 

The important point to note is the scaling of T with S in the two cases 
(12.7, 12.8). If we want the hash function to be collision-free, then we must 
have T greater than ~ S 2 . If we are happy to have a small frequency of 
collisions, then T needs to be of order S only. 

Solution to exercise 12.5 (p. 198). The posterior probability ratio for the two 
hypotheses, TL + = 'calculation correct' and H- = 'calculation incorrect' is 
the product of the prior probability ratio P(H + )/P(H ) and the likelihood 
ratio, P(match | 7^ + )/P(match | 7i ). This second factor is the answer to the 
question. The numerator P(match|7i + ) is equal to 1. The denominator's 
value depends on our model of errors. If we know that the human calculator is 
prone to errors involving multiplication of the answer by 10, or to transposition 
of adjacent digits, neither of which affects the hash value, then P(match \H~) 
could be equal to 1 also, so that the correct match gives no evidence in favour 
of TL+. But if we assume that errors are 'random from the point of view of the 
hash function' then the probability of a false positive is P(match | TL-) = 1/9, 
and the correct match gives evidence 9:1 in favour of 7i + . 

Solution to exercise 12.7 (p. 199). If you add a tiny M = 32 extra bits of hash 
to a huge iV-bit file you get pretty good error detection - the probability that 
an error is undetected is 2~ , less than one in a billion. To do error correction 
requires far more check bits, the number depending on the expected types of 
corruption, and on the file size. For example, if just eight random bits in a 
megabyte file are corrupted, it would take about log 2 ( 2 g ) — 23 x 8 ~ 180 
bits to specify which are the corrupted bits, and the number of parity-check 
bits used by a successful error-correcting code would have to be at least this 
number, by the counting argument of exercise 1.10 (solution, p. 20). 

Solution to exercise 12.10 (p. 201). We want to know the length L of a string 
such that it is very improbable that that string matches any part of the entire 
writings of humanity. Let's estimate that these writings total about one book 
for each person living, and that each book contains two million characters (200 
pages with 10 000 characters per page) - that's 10 16 characters, drawn from 
an alphabet of, say, 37 characters. 

The probability that a randomly chosen string of length L matches at one 
point in the collected works of humanity is 1/37 L . So the expected number 
of matches is 10 16 /37 L , which is vanishingly small if L > 16/ log 10 37 ~ 10. 
Because of the redundancy and repetition of humanity's writings, it is possible 
that L ~ 10 is an overestimate. 

So, if you want to write something unique, sit down and compose a string 
of ten characters. But don't write gidnebinzz, because I already thought of 
that string. 

As for a new melody, if we focus on the sequence of notes, ignoring duration 
and stress, and allow leaps of up to an octave at each note, then the number 
of choices per note is 23. The pitch of the first note is arbitrary. The number 
of melodies of length r notes in this rather ugly ensemble of Schonbergian 
tunes is 23 r_1 ; for example, there are 250 000 of length r = 5. Restricting 
the permitted intervals will reduce this figure; including duration and stress 
will increase it again. [If we restrict the permitted intervals to repetitions and 
tones or semitones, the reduction is particularly severe; is this why the melody 
of 'Ode to Joy' sounds so boring?] The number of recorded compositions is 
probably less than a million. If you learn 100 new melodies per week for every 
week of your life then you will have learned 250 000 melodies at age 50. Based 
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on empirical experience of playing the game 'guess that tune', it seems to 
me that whereas many four-note sequences are shared in common between 
melodies, the number of collisions between five-note sequences is rather smaller 

- most famous five-note sequences are unique. 

Solution to exercise 12.11 (p. 201). (a) Let the DNA-binding protein recognize 
a sequence of length L nucleotides. That is, it binds preferentially to that 
DNA sequence, and not to any other pieces of DNA in the whole genome. (In 
reality, the recognized sequence may contain some wildcard characters, e.g., 
the * in TATAA*A, which denotes 'any of A, C, G and T'; so, to be precise, we are 
assuming that the recognized sequence contains L non- wildcard characters.) 

Assuming the rest of the genome is 'random', i.e., that the sequence con- 
sists of random nucleotides A, C, G and T with equal probability - which is 
obviously untrue, but it shouldn't make too much difference to our calculation 

- the chance of there being no other occurrence of the target sequence in the 
whole genome, of length N nucleotides, is roughly 

(1/4) L ) N ~ ex P (-A(l/4) L ), 



(1 



(12.9) 



which is close to one only if 

A4~ L «1, (12.10) 

that is, 

L > log A/ log 4. (12-11) 

Using N = 3 x 10 9 , we require the recognized sequence to be longer than 
Lmm = 16 nucleotides. 

What size of protein does this imply? 

• A weak lower bound can be obtained by assuming that the information 
content of the protein sequence itself is greater than the information 
content of the nucleotide sequence the protein prefers to bind to (which 
we have argued above must be at least 32 bits). This gives a minimum 
protein length of 32/log 2 (20) ~ 7 amino acids. 

• Thinking realistically, the recognition of the DNA sequence by the pro- 
tein presumably involves the protein coming into contact with all sixteen 
nucleotides in the target sequence. If the protein is a monomer, it must 
be big enough that it can simultaneously make contact with sixteen nu- 
cleotides of DNA. One helical turn of DNA containing ten nucleotides 
has a length of 3.4 nm, so a contiguous sequence of sixteen nucleotides 
has a length of 5.4 nm. The diameter of the protein must therefore be 
about 5.4 nm or greater. Egg-white lysozyme is a small globular protein 
with a length of 129 amino acids and a diameter of about 4nm. As- 
suming that volume is proportional to sequence length and that volume 
scales as the cube of the diameter, a protein of diameter 5.4 nm must 
have a sequence of length 2.5 x 129 ~ 324 amino acids. 

(b) If, however, a target sequence consists of a twice-repeated sub-sequence, we 
can get by with a much smaller protein that recognizes only the sub-sequence, 
and that binds to the DNA strongly only if it can form a dimer, both halves 
of which are bound to the recognized sequence. Halving the diameter of the 
protein, we now only need a protein whose length is greater than 324/8 = 40 
amino acids. A protein of length smaller than this cannot by itself serve as 
a regulatory protein specific to one gene, because it's simply too small to be 
able to make a sufficiently specific match - its available surface does not have 
enough information content. 



In guess that tune, one player 
chooses a melody, and sings a 
gradually-increasing number of its 
notes, while the other participants 
try to guess the whole melody. 

The Parsons code is a related hash 
function for melodies: each pair of 
consecutive notes is coded as U 
('up') if the second note is higher 
than the first, R ('repeat') if the 
pitches are equal, and D ('down') 
otherwise. You can find out how 
well this hash function works at 
www . name-this-tune . com. 
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About Chapter 13 



In Chapters 8—11, we established Shannon's noisy-channel coding theorem 
for a general channel with any input and output alphabets. A great deal of 
attention in coding theory focuses on the special case of channels with binary 
inputs. Constraining ourselves to these channels simplifies matters, and leads 
us into an exceptionally rich world, which we will only taste in this book. 

One of the aims of this chapter is to point out a contrast between Shannon's 
aim of achieving reliable communication over a noisy channel and the apparent 
aim of many in the world of coding theory. Many coding theorists take as 
their fundamental problem the task of packing as many spheres as possible, 
with radius as large as possible, into an ./V-dimensional space, with no spheres 
overlapping. Prizes are awarded to people who find packings that squeeze in an 
extra few spheres. While this is a fascinating mathematical topic, we shall see 
that the aim of maximizing the distance between codewords in a code has only 
a tenuous relationship to Shannon's aim of reliable communication. 
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13 



Binary Codes 



We've established Shannon's noisy-channel coding theorem for a general chan- 
nel with any input and output alphabets. A great deal of attention in coding 
theory focuses on the special case of channels with binary inputs, the first 
implicit choice being the binary symmetric channel. 

The optimal decoder for a code, given a binary symmetric channel, finds 
the codeword that is closest to the received vector, closest in Hamming dis- 
tance. The Hamming distance between two binary vectors is the number of 
coordinates in which the two vectors differ. Decoding errors will occur if the 
noise takes us from the transmitted codeword t to a received vector r that 
is closer to some other codeword. The distances between codewords are thus 
relevant to the probability of a decoding error. 

► 13.1 Distance properties of a code 

The distance of a code is the smallest separation between two of its codewords. 

Example 13.1. The (7,4) Hamming code (p. 8) has distance d = 3. All pairs of 
its codewords differ in at least 3 bits. The maximum number of errors 
it can correct is t = 1; in general a code with distance d is |_(<i — 1) /2J- 
error- correcting. 

A more precise term for distance is the minimum distance of the code. The 
distance of a code is often denoted by d or d m i n . 

We'll now constrain our attention to linear codes. In a linear code, all 
codewords have identical distance properties, so we can summarize all the 
distances between the code's codewords by counting the distances from the 
all-zero codeword. 

The weight enumerator function of a code, A(w), is defined to be the 
number of codewords in the code that have weight w. The weight enumerator 
function is also known as the distance distribution of the code. 

Example 13.2. The weight enumerator functions of the (7,4) Hamming code 
and the dodecahedron code are shown in figures 13.1 and 13.2. 

► 13.2 Obsession with distance 

Since the maximum number of errors that a code can guarantee to correct, 
t, is related to its distance d by t = \_(d— 1)/2J, many coding theorists focus 
on the distance of a code, searching for codes of a given size that have the 
biggest possible distance. Much of practical coding theory has focused on 
decoders that give the optimal decoding for all error patterns of weight up to 
the half-distance t of their codes. 



Example: 

The Hamming distance 
between 00001111 
and 11001101 
is 3. 




w 


A(w) 





1 


3 


7 


4 


7 


7 


1 


Total 


16 



Figure 13.1. The graph of the 
(7, 4) Hamming code, and its 
weight enumerator function. 



d = 2t + 1 if d is odd, and 
d = 2t + 2 if d is even. 
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w 


A(w) 





1 


5 


12 


8 


30 


9 


20 


10 


72 


11 


120 


12 


100 


13 


180 


14 


240 


15 


272 


16 


345 


17 


300 


18 


200 


19 


120 


20 


36 


Total 


2048 




5 8 10 15 20 25 30 




Figure 13.2. The graph defining 
the (30, 11) dodecahedron code 
(the circles are the 30 transmitted 
bits and the triangles are the 20 
parity checks, one of which is 
redundant) and the weight 
enumerator function (solid lines). 
The dotted lines show the average 
weight enumerator function of all 
random linear codes with the 
same size of generator matrix, 
which will be computed shortly. 
The lower figure shows the same 
functions on a log scale. 



5 8 10 15 20 25 30 



A bounded-distance decoder is a decoder that returns the closest code- 
word to a received binary vector r if the distance from r to that codeword 
is less than or equal to t; otherwise it returns a failure message. 

The rationale for not trying to decode when more than t errors have occurred 
might be 'we can't guarantee that we can correct more than t errors, so we 
won't bother trying - who would be interested in a decoder that corrects some 
error patterns of weight greater than t, but not others?' This defeatist attitude 
is an example of worst-case-ism, a widespread mental ailment which this book 
is intended to cure. 

The fact is that bounded-distance decoders cannot reach the Shannon limit 
of the binary symmetric channel; only a decoder that often corrects more than 
t errors can do this. The state of the art in error-correcting codes have decoders 
that work way beyond the minimum distance of the code. 



Definitions of good and bad distance properties 



Given a family of codes of increasing blocklength N, and with rates approach- 
ing a limit R > 0, we may be able to put that family in one of the following 
categories, which have some similarities to the categories of 'good' and 'bad' 
codes defined earlier (p. 183): 

A sequence of codes has 'good' distance if d/N tends to a constant 
greater than zero. 

A sequence of codes has 'bad' distance if d/N tends to zero. 

A sequence of codes has 'very bad' distance if d tends to a constant. 

Example 13.3. A low-density generator-matrix code is a linear code whose K x 
N generator matrix G has a small number do of Is per row, regardless 
of how big N is. The minimum distance of such a code is at most do, so 
low-density generator-matrix codes have 'very bad' distance. 




Figure 13.3. The graph of a 
rate-V2 low-density 
generator-matrix code. The 
rightmost M of the transmitted 
bits are each connected to a single 
distinct parity constraint. 



While having large distance is no bad thing, we'll see, later on, why an 
emphasis on distance can be unhealthy. 
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Figure 13.4. Schematic picture of 
part of Hamming space perfectly 
filled by i-sphcres centred on the 
codewords of a perfect code. 



► 13.3 Perfect codes 

A t-sphere (or a sphere of radius t) in Hamming space, centred on a point x, 
is the set of points whose Hamming distance from x is less than or equal to t. 

The (7,4) Hamming code has the beautiful property that if we place 1- 
spheres about each of its 16 codewords, those spheres perfectly fill Hamming 
space without overlapping. As we saw in Chapter 1, every binary vector of 
length 7 is within a distance of t = 1 of exactly one codeword of the Hamming 
code. 

A code is a perfect f-error-correcting code if the set of i-spheres cen- 
tred on the codewords of the code fill the Hamming space without over- 
lapping. (See figure 13.4.) 

Let's recap our cast of characters. The number of codewords is S = 2 K . 
The number of points in the entire Hamming space is 2 N . The number of 
points in a Hamming sphere of radius t is 




(13.1) 



For a code to be perfect with these parameters, we require S times the number 
of points in the f-sphere to equal 2 N : 

2 N (13.2) 
2 N - K . (13.3) 

For a perfect code, the number of noise vectors in one sphere must equal 
the number of possible syndromes. The (7, 4) Hamming code satisfies this 
numerological condition because 



for a perfect code, 2 K 

w=0 
t 

or, equivalently, 



w=0 



(13.4) 
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Figure 13.5. Schematic picture of 
Hamming space not perfectly 
filled by i-spheres centred on the 
codewords of a code. The grey 
regions show points that are at a 
Hamming distance of more than t 
from any codeword. This is a 
misleading picture, as, for any 
code with large t in high 
dimensions, the grey space 
between the spheres takes up 
almost all of Hamming space. 



How happy we would be to use perfect codes 

If there were large numbers of perfect codes to choose from, with a wide 
range of blocklengths and rates, then these would be the perfect solution to 
Shannon's problem. We could communicate over a binary symmetric channel 
with noise level /, for example, by picking a perfect i-error-correcting code 
with blocklength N and t = f*N, where f* = f + S and N and 5 are chosen 
such that the probability that the noise flips more than t bits is satisfactorily 
small. 

However, there are almost no perfect codes. The only nontrivial perfect 
binary codes are 

1 . the Hamming codes, which are perfect codes with t = 1 and blocklength 
N = 2 M — 1, defined below; the rate of a Hamming code approaches 1 
as its blocklength N increases; 

2. the repetition codes of odd blocklength N, which are perfect codes with 
t = (N — l)/2; the rate of repetition codes goes to zero as l/N; and 

3. one remarkable 3-error-correcting code with 2 12 codewords of block- 
length N = 23 known as the binary Golay code. [A second 2-error- 
correcting Golay code of length N = 11 over a ternary alphabet was dis- 
covered by a Finnish football-pool enthusiast called Juhani Virtakallio 
in 1947.] 

There are no other binary perfect codes. Why this shortage of perfect codes? 
Is it because precise numerological coincidences like those satisfied by the 
parameters of the Hamming code (13.4) and the Golay code, 

are rare? Are there plenty of 'almost-perfect' codes for which the i-spheres fill 
almost the whole space? 

No. In fact, the picture of Hamming spheres centred on the codewords 
almost filling Hamming space (figure 13.5) is a misleading one: for most codes, 
whether they are good codes or bad codes, almost all the Hamming space is 
taken up by the space between i-spheres (which is shown in grey in figure 13.5). 

Having established this gloomy picture, we spend a moment filling in the 
properties of the perfect codes mentioned above. 
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00000000000000000000000000000 



nun 



liiiiiiiiini 



uN 



0000001111111111111 



vN 



13 — Binary Codes 
Figure 13.6. Three codewords. 



0000000000 
1111110000 

wN xN 



N 



The Hamming codes 

The (7, 4) Hamming code can be defined as the linear code whose 3x7 parity- 
check matrix contains, as its columns, all the 7 (= 2 3 — 1) non-zero vectors of 
length 3. Since these 7 vectors are all different, any single bit-flip produces a 
distinct syndrome, so all single-bit errors can be detected and corrected. 

We can generalize this code, with M = 3 parity constraints, as follows. The 
Hamming codes are single-error-correcting codes defined by picking a number 
of parity-check constraints, M; the blocklength N is N = 2 M — 1; the parity- 
check matrix contains, as its columns, all the N non-zero vectors of length M 
bits. 

The first few Hamming codes have the following rates: 



Checks, M 


(N,K) 


R = K/N 


2 


(3,1) 


1/3 repetition code R3 


3 


(7,4) 


4/7 (7, 4) Hamming code 


4 


(15,11) 


11/15 


5 


(31,26) 


26/31 


6 


(63,57) 


57/63 



Exercise 13.4. t 2 ' p ' 223 l What is the probability of block error of the (N,K) 
Hamming code to leading order, when the code is used for a binary 
symmetric channel with noise density /? 

► 13.4 Perfectness is unattainable - first proof 

We will show in several ways that useful perfect codes do not exist (here, 
'useful' means 'having large blocklength TV, and rate close neither to nor 1'). 

Shannon proved that, given a binary symmetric channel with any noise 
level /, there exist codes with large blocklength N and rate as close as you 
like to C(f) = 1 — H2(f) that enable communication with arbitrarily small 
error probability. For large N, the number of errors per block will typically be 
about fN, so these codes of Shannon are 'almost-certainly-//V-error-correcting' 
codes. 

Let's pick the special case of a noisy channel with / G (1/3,1/2). Can 
we find a large perfect code that is /7V-error-correcting? Well, let's suppose 
that such a code has been found, and examine just three of its codewords. 
(Remember that the code ought to have rate R ~ 1 — i?2(/), so it should have 
an enormous number (2 NR ) of codewords.) Without loss of generality, we 
choose one of the codewords to be the all-zero codeword and define the other 
two to have overlaps with it as shown in figure 13.6. The second codeword 
differs from the first in a fraction u + v of its coordinates. The third codeword 
differs from the first in a fraction v + w, and from the second in a fraction 
u + w. A fraction x of the coordinates have value zero in all three codewords. 
Now, if the code is //V-error-correcting, its minimum distance must be greater 
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than 2fN, so 

u + v>2f 7 v + w>2f 7 and u + w > 2f. (13.6) 

Summing these three inequalities and dividing by two, we have 

u + v + w >3/. (13.7) 

So if / > 1/3, we can deduce u + v + w > 1, so that x < 0, which is impossible. 
Such a code cannot exist. So the code cannot have three codewords, let alone 
2 NR . 

We conclude that, whereas Shannon proved there are plenty of codes for 
communicating over a binary symmetric channel with / > 1/3, there are no 
perfect codes that can do this. 

We now study a more general argument that indicates that there are no 
large perfect linear codes for general rates (other than and 1). We do this 
by finding the typical distance of a random linear code. 

► 13.5 Weight enumerator function of random linear codes 

Imagine making a code by picking the binary entries in the M x TV parity-check 
matrix H at random. What weight enumerator function should we expect? 

The weight enumerator of one particular code with parity-check matrix H, 
A(w)n, is the number of codewords of weight w, which can be written 



A(w) 



H 



E l[Hx = 0] 



(13.8) 



x: x \—w 



where the sum is over all vectors x whose weight is w and the truth function 
l[Hx = 0] equals one if Hx = and zero otherwise. 
We can find the expected value of A(w), 



(A(w)) 



H 

E E p ( H ) 1 [ Hx =°] 



(13.9) 
(13.10) 



x:|x|=u7 H 



by evaluating the probability that a particular word of weight w > is a 
codeword of the code (averaging over all binary linear codes in our ensemble) . 
By symmetry, this probability depends only on the weight w of the word, not 
on the details of the word. The probability that the entire syndrome Hx is 
zero can be found by multiplying together the probabilities that each of the 
M bits in the syndrome is zero. Each bit z m of the syndrome is a sum (mod 
2) of w random bits, so the probability that z m = is l h. The probability that 
Hx = is thus 

(13.11) 



VP(H)l[Hx = 0] = (V2) M = 2" 



M 



H 



independent of w. 

The expected number of words of weight w (13.10) is given by summing, 
over all words of weight w, the probability that each word is a codeword. The 
number of words of weight w is (^) , so 



N 



1010 

11 
10 11 
0000 
0000 
1100 
10 11 

1 100 
1000 
0100 
0101 
10 11 



10 10 
10 11 
10 11 
10 11 
00 11 
1000 
1110 
10 11 
1110 
0100 

1111 

1010 



1001 
11000 
10010 
11001 
111 

11111 

10 1 
00 1 1 
10 10 
00101 
1111 
10010 



10 10 
110 
110 
110 
10 10 
0000 
0000 
10 1 
0001 
0100 

1111 

0110 



0010110-. 
110 10 00 
110 100 
1001000 
000 100 
10 1110 
1001110 
110 10 10 
11110 1 
110 10 10 
1110 10 
100001 1. 



Figure 13.7. A random binary 
parity-check matrix. 



(A(w)) 



N 



-M 



for any w > 0. 



(13.12) 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



212 



13 — Binary Codes 



For large N, we can use log Q ~ NH 2 {w/N) and R ~ 1 - M/N to write 

log 2 (^(w)) ~ NH 2 (w/N) - M (13.13) 
~ N[H 2 (w/N) - (1 - 7?)] for any w > 0. (13.14) 

As a concrete example, figure 13.8 shows the expected weight enumerator 
function of a rate- 1/3 random linear code with N = 540 and M = 360. 

Gilbert- Varshamov distance 

For weights w such that H 2 {w/N) < (1 — R), the expectation of A(w) is 
smaller than 1; for weights such that H2(w/N) > (1 — R), the expectation is 
greater than 1. We thus expect, for large N, that the minimum distance of a 
random linear code will be close to the distance day defined by 



H 2 (d GV /N) = (l-R). 



(13.15) 



Definition. This distance, c?gv = NH^ 1 ^ — R), is the Gilbert-Varshainov 
distance for rate R and blocklength N. 

The Gilbert-Varshamov conjecture, widely believed, asserts that (for large 
N) it is not possible to create binary codes with minimum distance significantly 
greater than dcv- 

Definition. The Gilbert— Varshamov rate Rqv is the maximum rate at which 
you can reliably communicate with a bounded-distance decoder (as defined on 
p. 207), assuming that the Gilbert-Varshamov conjecture is true. 

Why sphere-packing is a bad perspective, and an obsession with distance 
is inappropriate 

If one uses a bounded- distance decoder, the maximum tolerable noise level 
will flip a fraction /bd = ^d m i n /N of the bits. So, assuming d m i n is equal to 
the Gilbert distance dcv (13.15), we have: 



H 2 (2f hd ) = (1 - Rg V ). 
i? GV = 1 - H 2 (2f hd ). 



(13.16) 
(13.17) 



Now, here's the crunch: what did Shannon say is achievable? He said the 
maximum possible rate of communication is the capacity, 




Figure 13.8. The expected weight 
enumerator function (A(w)) of a 
random linear code with N = 540 
and M = 360. Lower figure shows 
(A(w)) on a logarithmic scale. 




Figure 13.9. Contrast between 
Shannon's channel capacity C and 
the Gilbert rate Rgv - the 
maximum communication rate 
achievable using a 
bounded-distance decoder, as a 
function of noise level /. For any 
given rate, R, the maximum 
tolerable noise level for Shannon 
is twice as big as the maximum 
tolerable noise level for a 
'worst-case-ist' who uses a 
bounded-distance decoder. 



C 



(13.18) 



So for a given rate R, the maximum tolerable noise level, according to Shannon, 
is given by 

H 2 (f) = (l-R). (13.19) 

Our conclusion: imagine a good code of rate R has been chosen; equations 
(13.16) and (13.19) respectively define the maximum noise levels tolerable by 
a bounded-distance decoder, /bd, and by Shannon's decoder, /. 



/bd = //2. 



(13.20) 



Bounded-distance decoders can only ever cope with half the noise- level that 
Shannon proved is tolerable! 

How does this relate to perfect codes? A code is perfect if there are t- 
spheres around its codewords that fill Hamming space without overlapping. 
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But when a typical random linear code is used to communicate over a bi- 
nary symmetric channel near to the Shannon limit, the typical number of bits 
flipped is j'N, and the minimum distance between codewords is also f'N, or 
a little bigger, if we are a little below the Shannon limit. So the //V-spheres 
around the codewords overlap with each other sufficiently that each sphere 
almost contains the centre of its nearest neighbour! The reason why this 
overlap is not disastrous is because, in high dimensions, the volume associated 
with the overlap, shown shaded in figure 13.10, is a tiny fraction of either 
sphere, so the probability of landing in it is extremely small. 

The moral of the story is that worst-case-ism can be bad for you, halving 
your ability to tolerate noise. You have to be able to decode way beyond the 
minimum distance of a code to get to the Shannon limit! 

Nevertheless, the minimum distance of a code is of interest in practice, 
because, under some conditions, the minimum distance dominates the errors 
made by a code. 




Figure 13.10. Two overlapping 
spheres whose radius is almost as 
big as the distance between their 
centres. 



► 13.6 Berlekamp's bats 

A blind bat lives in a cave. It flies about the centre of the cave, which corre- 
sponds to one codeword, with its typical distance from the centre controlled 
by a friskiness parameter /. (The displacement of the bat from the centre 
corresponds to the noise vector.) The boundaries of the cave are made up of 
stalactites that point in towards the centre of the cave (figure 13.11). Each 
stalactite is analogous to the boundary between the home codeword and an- 
other codeword. The stalactite is like the shaded region in figure 13.10, but 
reshaped to convey the idea that it is a region of very small volume. 

Decoding errors correspond to the bat's intended trajectory passing inside 
a stalactite. Collisions with stalactites at various distances from the centre 
are possible. 

If the friskiness is very small, the bat is usually very close to the centre 
of the cave; collisions will be rare, and when they do occur, they will usually 
involve the stalactites whose tips are closest to the centre point. Similarly, 
under low-noise conditions, decoding errors will be rare, and they will typi- 
cally involve low-weight codewords. Under low-noise conditions, the minimum 
distance of a code is relevant to the (very small) probability of error. 




Figure 13.11. Berlekamp's 
schematic picture of Hamming 
space in the vicinity of a 
codeword. The jagged solid line 
encloses all points to which this 
codeword is the closest. The 
t-sphere around the codeword 
takes up a small fraction of this 
space. 
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If the friskiness is higher, the bat may often make excursions beyond the 
safe distance t where the longest stalactites start, but it will collide most fre- 
quently with more distant stalactites, owing to their greater number. There's 
only a tiny number of stalactites at the minimum distance, so they are rela- 
tively unlikely to cause the errors. Similarly, errors in a real error-correcting 
code depend on the properties of the weight enumerator function. 

At very high friskiness, the bat is always a long way from the centre of 
the cave, and almost all its collisions involve contact with distant stalactites. 
Under these conditions, the bat's collision frequency has nothing to do with 
the distance from the centre to the closest stalactite. 



► 13.7 Concatenation of Hamming codes 



It is instructive to play some more with the concatenation of Hamming codes, 
a concept we first visited in figure 11.6, because we will get insights into the 
notion of good codes and the relevance or otherwise of the minimum distance 
of a code. 

We can create a concatenated code for a binary symmetric channel with 
noise density / by encoding with several Hamming codes in succession. 

The table recaps the key properties of the Hamming codes, indexed by 
number of constraints, M. All the Hamming codes have minimum distance 
d = 3 and can correct one error in N. 



N - 
K -- 

Pb 



2 M _ 1 

N - M 



blocklength 

number of source bits 

probability of block error to leading order 



If we make a product code by concatenating a sequence of C Hamming 
codes with increasing M, we can choose those parameters {M c }^ =1 in such a 
way that the rate of the product code 



Rc = U 



N c -M c 



(13.21) 



tends to a non-zero limit as C increases. For example, if we set Mi = 2, 
M2 = 3, M3 = 4, etc., then the asymptotic rate is 0.093 (figure 13.12). 

The blocklength N is a rapidly-growing function of C, so these codes are 
somewhat impractical. A further weakness of these codes is that their min- 
imum distance is not very good (figure 13.13). Every one of the constituent 
Hamming codes has minimum distance 3, so the minimum distance of the 
Cth product is 3 C . The blocklength N grows faster than 3 C , so the ratio d/N 
tends to zero as C increases. In contrast, for typical random codes, the ratio 
d/N tends to a constant such that H2(d/N) = 1 — 7?. Concatenated Hamming 
codes thus have 'bad' distance. 

Nevertheless, it turns out that this simple sequence of codes yields good 
codes for some channels - but not very good codes (see section 11.4 to recall 
the definitions of the terms 'good' and 'very good'). Rather than prove this 
result, we will simply explore it numerically. 

Figure 13.14 shows the bit error probability pb of the concatenated codes 
assuming that the constituent codes are decoded in sequence, as described 
in section 11.4. [This one-code-at-a-time decoding is suboptimal, as we saw 
there.] The horizontal axis shows the rates of the codes. As the number 
of concatenations increases, the rate drops to 0.093 and the error probability 
drops towards zero. The channel assumed in the figure is the binary symmetric 




Figure 13.12. The rate R of the 
concatenated Hamming code as a 
function of the number of 
concatenations. C . 



1e+25 
1e+20 
1e+15 
1e+10 
100000 

1 



Figure 13.13. The blocklength N c 
(upper curve) and minimum 
distance d c (lower curve) of the 
concatenated Hamming code as a 
function of the number of 
concatenations C. 
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channel with / = 0.0588. This is the highest noise level that can be tolerated 
using this concatenated code. 

The take-home message from this story is distance isn't everything. The 
minimum distance of a code, although widely worshipped by coding theorists, 
is not of fundamental importance to Shannon's mission of achieving reliable 
communication over noisy channels. 

> Exercise 13.5. ^ Prove that there exist families of codes with 'bad' distance 
that are 'very good' codes. 

► 13.8 Distance isn't everything 

Let's get a quantitative feeling for the effect of the minimum distance of a 
code, for the special case of a binary symmetric channel. 

The error probability associated with one low-weight codeword 

Let a binary code have blocklength N and just two codewords, which differ in 
d places. For simplicity, let's assume d is even. What is the error probability 
if this code is used on a binary symmetric channel with noise level /? 

Bit flips matter only in places where the two codewords differ. The error 
probability is dominated by the probability that d/2 of these bits are flipped. 
What happens to the other bits is irrelevant, since the optimal decoder ignores 
them. 



P(block error) 



d 
d/2 



f d,2 {l-f) d > 2 . 



(13.22) 



This error probability associated with a single codeword of weight d is plotted 
in figure 13.15. Using the approximation for the binomial coefficient (1.16), 
we can further approximate 

P(block error) ~ ^/^(l - /) 1/2 ] * (13.23) 

= Wt, (13-24) 

where /3(f) = 2/ 1 / 2 (l — f) 1 ' 2 is called the Bhattacharyya parameter of the 
channel. 

Now, consider a general linear code with distance d. Its block error prob- 
ability must be at least {//2)f d ^ 2 {^ — f) d ^ 2 , independent of the blocklength 
N of the code. For this reason, a sequence of codes of increasing blocklength 
N and constant distance d (i.e., 'very bad' distance) cannot have a block er- 
ror probability that tends to zero, on any binary symmetric channel. If we 
are interested in making superb error-correcting codes with tiny, tiny error 
probability, we might therefore shun codes with bad distance. However, being 
pragmatic, we should look more carefully at figure 13.15. In Chapter 1 we 
argued that codes for disk drives need an error probability smaller than about 
10~ 18 . If the raw error probability in the disk drive is about 0.001, the error 
probability associated with one codeword at distance d = 20 is smaller than 
10~ 24 . If the raw error probability in the disk drive is about 0.01, the error 
probability associated with one codeword at distance d = 30 is smaller than 
10~ 20 . For practical purposes, therefore, it is not essential for a code to have 
good distance. For example, codes of blocklength 10000, known to have many 
codewords of weight 32, can nevertheless correct errors of weight 320 with tiny 
error probability. 
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Figure 13.14. The bit error 
probabilities versus the rates R of 
the concatenated Hamming codes, 
for the binary symmetric channel 
with / = 0.0588. Labels alongside 
the points show the blocklengths, 
N . The solid line shows the 
Shannon limit for this channel. 
The bit error probability drops to 
zero while the rate tends to 0.093, 
so the concatenated Hamming 
codes are a 'good' code family. 
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Figure 13.15. The error 
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I wouldn't want you to think I am recommending the use of codes with 
bad distance; in Chapter 47 we will discuss low-density parity-check codes, my 
favourite codes, which have both excellent performance and good distance. 



► 13.9 The union bound 

The error probability of a code on the binary symmetric channel can be 
bounded in terms of its weight enumerator function by adding up appropriate 
multiples of the error probability associated with a single codeword (13.24): 

P(block error) < ^ A{w){f3{f)} w . (13.25) 

u>>0 

This inequality, which is an example of a union bound, is accurate for low 
noise levels /, but inaccurate for high noise levels, because it overcounts the 
contribution of errors that cause confusion with more than one codeword at a 
time. 

> Exercise 13.6. Poor man's noisy-channel coding theorem. 

Pretending that the union bound (13.25) is accurate, and using the aver- 
age weight enumerator function of a random linear code (13.14) (section 
13.5) as A(w), estimate the maximum rate -Rtjb(/) at which one can 
communicate over a binary symmetric channel. 

Or, to look at it more positively, using the union bound (13.25) as an 
inequality, show that communication at rates up to -Rtjb(/) is possible 
over the binary symmetric channel. 

In the following chapter, by analysing the probability of error of syndrome 
decoding for a binary linear code, and using a union bound, we will prove 
Shannon's noisy-channel coding theorem (for symmetric binary channels), and 
thus show that very good linear codes exist. 



► 13.10 Dual codes 

A concept that has some importance in coding theory, though we will have 
no immediate use for it in this book, is the idea of the dual of a linear error- 
correcting code. 

An (N,K) linear error-correcting code can be thought of as a set of 2 K 
codewords generated by adding together all combinations of K independent 
basis codewords. The generator matrix of the code consists of those K basis 
codewords, conventionally written as row vectors. For example, the (7,4) 
Hamming code's generator matrix (from p. 10) is 



G 



1 1 1 

10 110 

10 111 

1 1 1 



(13.26) 



and its sixteen codewords were displayed in table 1.14 (p. 9). The code- 
words of this code are linear combinations of the four vectors [10 1 1], 
[0100110], [0010111], and [0 1 1 l]. 

An (N, K) code may also be described in terms of an M x N parity-check 
matrix (where M = N — K) as the set of vectors {t} that satisfy 



Ht = 0. 



(13.27) 
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One way of thinking of this equation is that each row of H specifies a vector 
to which t must be orthogonal if it is a codeword. 



The generator matrix specifies K vectors from which all codewords 
can be built, and the parity-check matrix specifies a set of M vectors 
to which all codewords are orthogonal. 

The dual of a code is obtained by exchanging the generator matrix 
and the parity-check matrix. 



Definition. The set of all vectors of length N that are orthogonal to all code- 
words in a code, C, is called the dual of the code, C L . 

If t is orthogonal to hi and I12, then it is also orthogonal to I13 = hi + 112; 
so all codewords are orthogonal to any linear combination of the M rows of 
H. So the set of all linear combinations of the rows of the parity-check matrix 
is the dual code. 

For our Hamming (7,4) code, the parity-check matrix is (from p. 12): 



H = [ P I 3 



1110 10 
1110 10 
10 110 1 



(13.28) 



The dual of the (7, 4) Hamming code Wp^) is the code shown in table 13.16. 



0000000 
0010111 



0101101 
0111010 



1001110 
1011001 



1100011 
1110100 



A possibly unexpected property of this pair of codes is that the dual, 



n 



, 7i y is contained within the code 7^(7,4) itself: every word in the dual code 
is a codeword of the original (7, 4) Hamming code. This relationship can be 
written using set notation: 



W (7,4) C W (7,4)- 



(13.29) 



The possibility that the set of dual vectors can overlap the set of codeword 
vectors is counterintuitive if we think of the vectors as real vectors - how can 
a vector be orthogonal to itself? But when we work in modulo-two arithmetic, 
many non-zero vectors are indeed orthogonal to themselves! 

> Exercise 13.7.^' P' 223 l Give a simple rule that distinguishes whether a binary 
vector is orthogonal to itself, as is each of the three vectors [l 1 10 10 0], 
[0111010], and [l 1 1 00 l]. 



Table 13.16. The eight codewords 
of the dual of the (7, 4) Hamming 
code. [Compare with table 1.14, 
p.9.] 



Some more duals 

In general, if a code has a systematic generator matrix, 

G= [Ik|P t ], 

where P is a K x M matrix, then its parity-check matrix is 



(13.30) 



H = [P|I M ] • 



(13.31) 
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Example 13.8. The repetition code R3 has generator matrix 

G= [ 1 1 1 ]; 

its parity-check matrix is 



H 



110 
1 1 



The two codewords are [ill] and [0 0]. 
The dual code has generator matrix 



H 



110 
10 1 



(13.32) 



(13.33) 



(13.34) 



or equivalently, modifying G into systematic form by row additions, 



1 
1 



(13.35) 



We call this dual code the simple parity code P3; it is the code with one 
parity-check bit, which is equal to the sum of the two source bits. The 
dual code's four codewords are [1 1 0], [l l], [0 0], and [Oil]. 

In this case, the only vector common to the code and the dual is the 
all-zero codeword. 



Goodness of duals 

If a sequence of codes is 'good', are their duals good too? Examples can be 
constructed of all cases: good codes with good duals (random linear codes); 
bad codes with bad duals; and good codes with bad duals. The last category 
is especially important: many state-of-the-art codes have the property that 
their duals are bad. The classic example is the low-density parity-check code, 
whose dual is a low-density generator-matrix code. 

> Exercise 13.9. ^ Show that low-density generator-matrix codes are bad. A 
family of low-density generator-matrix codes is defined by two param- 
eters j, k, which are the column weight and row weight of all rows and 
columns respectively of G. These weights are fixed, independent of N; 
for example, (j,k) = (3,6). [Hint: show that the code has low-weight 
codewords, then use the argument from p. 215.] 

Exercise 13.10. ^ Show that low-density parity-check codes are good, and have 
good distance. (For solutions, see Gallager (1963) and MacKay (1999b).) 



Self- dual codes 

The (7, 4) Hamming code had the property that the dual was contained in the 
code itself. A code is self-orthogonal if it is contained in its dual. For example, 
the dual of the (7,4) Hamming code is a self-orthogonal code. One way of 
seeing this is that the overlap between any pair of rows of H is even. Codes that 
contain their duals are important in quantum error-correction (Calderbank 
and Shor, 1996). 

It is intriguing, though not necessarily useful, to look at codes that are 
self-dual. A code C is self-dual if the dual of the code is identical to the code. 

C J ~=C. (13.36) 



Some properties of self-dual codes can be deduced: 
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1. If a code is self-dual, then its generator matrix is also a parity-check 
matrix for the code. 

2. Self-dual codes have rate 1/2, i.e., M = K = N/2. 

3. All codewords have even weight. 

> Exercise 13.11. ^' P- 223 l What property must the matrix P satisfy, if the code 
with generator matrix G = [Ik\ P t ] is self-dual? 



Examples of self-dual codes 

1. The repetition code R2 is a simple example of a self-dual code. 

G = H = [ 1 1 ] . (13.37) 

2. The smallest non-trivial self-dual code is the following (8,4) code. 



[ I 4 I P T 



10 
10 
10 
1 



111 
10 11 
110 1 
1110 



(13.38) 



> Exercise 13.12.^' P- 223 l Find the relationship of the above (8,4) code to the 
(7, 4) Hamming code. 



Duals and graphs 

Let a code be represented by a graph in which there are nodes of two types, 
parity-check constraints and equality constraints, joined by edges which rep- 
resent the bits of the code (not all of which need be transmitted). 

The dual code's graph is obtained by replacing all parity-check nodes by 
equality nodes and vice versa. This type of graph is called a normal graph by 
Forney (2001). 



Further reading 

Duals are important in coding theory because functions involving a code (such 
as the posterior distribution over codewords) can be transformed by a Fourier 
transform into functions over the dual code. For an accessible introduction 
to Fourier analysis on finite groups, see Terras (1999). See also MacWilliams 
and Sloane (1977). 



► 13.11 Generalizing perfectness to other channels 



Having given up on the search for perfect codes for the binary symmetric 
channel, we could console ourselves by changing channel. We could call a 
code 'a perfect w-error-correcting code for the binary erasure channel' if it 
can restore any u erased bits, and never more than u. Rather than using the 
word perfect, however, the conventional term for such a code is a 'maximum 
distance separable code', or MDS code. 

As we already noted in exercise 11.10 (p. 190), the (7,4) Hamming code is 
not an MDS code. It can recover some sets of 3 erased bits, but not all. If 
any 3 bits corresponding to a codeword of weight 3 are erased, then one bit of 
information is unrecoverable. This is why the (7, 4) code is a poor choice for 
a RAID system. 



In a perfect u-error-correcting 
code for the binary erasure 
channel, the number of redundant 
bits must be N — K = u. 
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A tiny example of a maximum distance separable code is the simple parity- 
check code P3 whose parity-check matrix is H = [ill]. This code has 4 
codewords, all of which have even parity. All codewords are separated by 
a distance of 2. Any single erased bit can be restored by setting it to the 
parity of the other two bits. The repetition codes are also maximum distance 
separable codes. 

Exercise 13.13. t 5 ' p ' 224 l Can you make an (TV, TV) code, with M = N - K 
parity symbols, for a g-ary erasure channel, such that the decoder can 
recover the codeword when any M symbols are erased in a block of TV? 
[Example: for the channel with q = 4 symbols there is an (TV, K) = (5, 2) 
code which can correct any M = 3 erasures.] 

For the g-ary erasure channel with q > 2, there are large numbers of MDS 
codes, of which the Reed-Solomon codes are the most famous and most widely 
used. As long as the field size q is bigger than the blocklength TV, MDS block 
codes of any rate can be found. (For further reading, see Lin and Costello 
(1983).) 

► 13.12 Summary 

Shannon's codes for the binary symmetric channel can almost always correct 
/TV errors, but they are not /TV-error-correcting codes. 

Reasons why the distance of a code has little relevance 

1. The Shannon limit shows that the best codes must be able to cope with 
a noise level twice as big as the maximum noise level for a bounded- 
distance decoder. 

2. When the binary symmetric channel has / > 1/4, no code with a 
bounded-distance decoder can communicate at all; but Shannon says 
good codes exist for such channels. 

3. Concatenation shows that we can get good performance even if the dis- 
tance is bad. 

The whole weight enumerator function is relevant to the question of 
whether a code is a good code. 

The relationship between good codes and distance properties is discussed 
further in exercise 13.14 (p. 220). 

► 13.13 Further exercises 

Exercise 13. 14. P- 224 l A codeword t is selected from a linear (TV, TV) code 
C, and it is transmitted over a noisy channel; the received signal is y. 
We assume that the channel is a memoryless channel such as a Gaus- 
sian channel. Given an assumed channel model P(y |t), there are two 
decoding problems. 

The codeword decoding problem is the task of inferring which 
codeword t was transmitted given the received signal. 

The bitwise decoding problem is the task of inferring for each 
transmitted bit t n how likely it is that that bit was a one rather 
than a zero. 
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Consider optimal decoders for these two decoding problems. Prove that 
the probability of error of the optimal bitwise-decoder is closely related 
to the probability of error of the optimal codeword-decoder, by proving 
the following theorem. 

Theorem 13.1 If a binary linear code has minimum distance d m \ n , 
then, for any given channel, the codeword bit error probability of the 
optimal bitwise decoder, pt>, and the block error probability of the maxi- 
mum likelihood decoder, p-Q, are related by: 



^ .1 ^min 

Pb > Pb > 2~~N~ PB ' 



(13.39) 



. " Exercise 13.15. [1] What are the minimum distances of the (15, 11) Hamming 
»j£ code and the (31, 26) Hamming code? 

> Exercise 13.16.^ Let A(w) be the average weight enumerator function of a 
rate-1/3 random linear code with N = 540 and M = 360. Estimate, 
from first principles, the value of A(w) at w = 1. 

Exercise 13.17. A code with minimum distance greater than dev- A rather 
nice (15,5) code is generated by this generator matrix, which is based 
on measuring the parities of all the (J) = 10 triplets of source bits: 



• 1 1 1 • • 1 1 • 1 

■ ■ 1 1 1 1 ■ 1 1 ■ 

1 • • 1 1 • 1 • 1 1 

1 1 • • 1 1 • 1 • 1 



1111 



1 1 



(13.40) 



Find the minimum distance and weight enumerator function of this code. 

Exercise 13.18. Find the minimum distance of the 'pentagonful' low- 
density parity-check code whose parity-check matrix is 



H 



1— 1 ■ ■ ■ 1— 1 
■ 1—1 1—1 
■ 1—1 1—1 
■ 1—1 1—1 


1 . . . . 
. 1 . . . 
. . 1 . . 
. . . 1 . 
.... 1 






. . . 1 . 
. 1 . . . 
.... 1 
. . 1 . . 


1— 1 ■ ■ ■ 1— 1 
■ 1—1 1—1 
■ 1—1 1—1 
■ 1— 1 1— 1 
1—1 1—1 



(13.41) 



Show that nine of the ten rows are independent, so the code has param- 
eters N = 15, K = 6. Using a computer, find its weight enumerator 
function. 

> Exercise 13.19. ^ 3C ^ Replicate the calculations used to produce figure 13.12. 
Check the assertion that the highest noise level that's correctable is 
0.0588. Explore alternative concatenated sequences of codes. Can you 
find a better sequence of concatenated codes - better in the sense that it 
has either higher asymptotic rate 7? or can tolerate a higher noise level 
/? 




Figure 13.17. The graph of the 
pentagonful low-density 
parity-check code with 15 bit 
nodes (circles) and 10 
parity-check nodes (triangles). 
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Exercise 13.20. ^' p ' 22f >] Investigate the possibility of achieving the Shannon 
limit with linear block codes, using the following counting argument. 
Assume a linear code of large blocklength N and rate R = K/N . The 
code's parity-check matrix H has M = N — K rows. Assume that the 
code's optimal decoder, which solves the syndrome decoding problem 
Hn = z, allows reliable communication over a binary symmetric channel 
with flip probability /. 

How many 'typical' noise vectors n are there? 

Roughly how many distinct syndromes z are there? 

Since n is reliably deduced from z by the optimal decoder, the number 
of syndromes must be greater than or equal to the number of typical 
noise vectors. What does this tell you about the largest possible value 
of rate R for a given /? 

t> Exercise 13.21. ^ Linear binary codes use the input symbols and 1 with 
equal probability, implicitly treating the channel as a symmetric chan- 
nel. Investigate how much loss in communication rate is caused by this 
assumption, if in fact the channel is a highly asymmetric channel. Take 
as an example a Z-channel. How much smaller is the maximum possible 
rate of communication using symmetric inputs than the capacity of the 
channel? [Answer: about 6%.] 

Exercise 13.22.^ Show that codes with 'very bad' distance are 'bad' codes, as 
defined in section 11.4 (p. 183). 

Exercise 13.23.^ One linear code can be obtained from another by punctur- 
ing. Puncturing means taking each codeword and deleting a defined set 
of bits. Puncturing turns an (N,K) code into an (N',K) code, where 
N' < N. 

Another way to make new linear codes from old is shortening. Shortening 
means constraining a defined set of bits to be zero, and then deleting 
them from the codewords. Typically if we shorten by one bit, half of the 
code's codewords are lost. Shortening typically turns an (N, K) code 
into an (N' , K') code, where N - N' = K - K' . 

Another way to make a new linear code from two old ones is to make 
the intersection of the two codes: a codeword is only retained in the new 
code if it is present in both of the two old codes. 

Discuss the effect on a code's distance properties of puncturing, short- 
ening, and intersection. Is it possible to turn a code family with bad 
distance into a code family with good distance, or vice versa, by each of 
these three manipulations? 

Exercise 13.24. [5 ' p 226] Todd Ebert's 'hat puzzle'. 

Three players enter a room and a red or blue hat is placed on each 
person's head. The colour of each hat is determined by a coin toss, with 
the outcome of one coin toss having no effect on the others. Each person 
can see the other players' hats but not his own. 

No communication of any sort is allowed, except for an initial strategy 
session before the group enters the room. Once they have had a chance 
to look at the other hats, the players must simultaneously guess their 
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own hat's colour or pass. The group shares a $3 million prize if at least 
one player guesses correctly and no players guess incorrectly. 

The same game can be played with any number of players. The general 
problem is to find a strategy for the group that maximizes its chances of 
winning the prize. Find the best strategies for groups of size three and 
seven. 

[Hint: when you've done three and seven, you might be able to solve 
fifteen.] 

Exercise 13.25. ^ Estimate how many binary low-density parity-check codes 
have self-orthogonal duals. [Note that we don't expect a huge number, 
since almost all low-density parity-check codes are 'good', but a low- 
density parity-check code that contains its dual must be 'bad'.] 

Exercise 13.26. [2C] In fi gure 13.15 we plotted the error probability associated 
with a single codeword of weight d as a function of the noise level / of a 
binary symmetric channel. Make an equivalent plot for the case of the 
Gaussian channel, showing the error probability associated with a single 
codeword of weight d as a function of the rate-compensated signal-to- 
noise ratio E^/Nq. Because E^/Nq depends on the rate, you have to 
choose a code rate. Choose R = 1/2, 2/3, 3/4, or 5/6. 



If you already know the hat 
puzzle, you could try the 'Scottish 
version' of the rules in which the 
prize is only awarded to the group 
if they all guess correctly. 

In the 'Reformed Scottish 
version', all the players must 
guess correctly, and there are two 
rounds of guessing. Those players 
who guess during round one leave 
the room. The remaining players 
must guess in round two. What 
strategy should the team adopt to 
maximize their chance of winning? 



► 13.14 Solutions 

Solution to exercise 13.4 (p. 210). The probability of block error to leading 
order is p B = |p(^)/ 2 . 

Solution to exercise 13.7 (p. 217). A binary vector is perpendicular to itself if 
it has even weight, i.e., an even number of Is. 

Solution to exercise 13.11 (p. 219). The self-dual code has two equivalent 
parity-check matrices, Hi = G = [I/f|P T ] and H2 = [P|Ijr]; these must be 
equivalent to each other through row additions, that is, there is a matrix U 
such that UH2 = Hi, so 



[up|ui*] = [Ik|p t ]. 



(13.42) 



From the right-hand sides of this equation, we have U = P T , so the left-hand 
sides become: 

P T P = I K . (13.43) 

Thus if a code with generator matrix G = [I K |P T ] is self-dual then P is an 
orthogonal matrix, modulo 2, and vice versa. 

Solution to exercise 13.12 (p. 219). The (8,4) and (7,4) codes are intimately 
related. The (8, 4) code, whose parity-check matrix is 



H = [ P I I 4 ] 



111 
10 11 
110 1 
1110 



10 
10 
10 
1 



(13.44) 



is obtained by (a) appending an extra parity-check bit which is the parity of 
all seven bits of the (7,4) Hamming code; and (b) reordering the first four 
bits. 
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Solution to exercise 13.13 (p. 220). If an (N, K) code, with M = N-K parity 
symbols, has the property that the decoder can recover the codeword when any 
M symbols are erased in a block of N, then the code is said to be maximum 
distance separable (MDS). 

No MDS binary codes exist, apart from the repetition codes and simple 
parity codes. For q > 2, some MDS codes can be found. 

As a simple example, here is a (9, 2) code for the 8-ary erasure channel. 
The code is defined in terms of the multiplication and addition rules of GF{8), 
which are given in Appendix C.l. The elements of the input alphabet are 
{0, 1, A, B, C, D, E, F} and the generator matrix of the code is 



G 



1 1 A B C D E F 
11111111 



The resulting 64 codewords are: 



ooooooooo 

101ABCDEF 
A0ACEB1FD 
B0BEDFC1A 
C0CBFEAD1 
D0D1CAFBE 



011111111 
110BADCFE 
A1BDFA0EC 
B1AFCED0B 
C1DAEFBC0 
D1C0DBEAF 



E0EF1DBAC E1FE0CABD 
F0FDA1ECB F1ECB0FDA 



0AAAAAAAA 
1AB01EFCD 
AA0EC1BDF 
BA1CFDEB0 
CAE1DC0FB 
DAFBE0D1C 
EACDBF10E 
FADF0BCE1 



0BBBBBBBB 
1BA10FEDC 
AB1FD0ACE 
BB0DECFA1 
CBF0CD1EA 
DBEAF1C0D 
EBDCAE01F 
FBCE1ADF0 



0CCCCCCCC 
1CDEF01AB 
ACE0AFDB1 
BCFA1B0DE 
CC0FBAE1D 
DC1D0EBFA 
ECABD1FE0 
FCB1EDA0F 



(13.45) 



0DDDDDDDD 
1DCFE10BA 
ADF1BECA0 
BDEB0A1CF 
CD1EABF0C 
DD0C1FAEB 
EDBAC0EF1 
FDA0FCB1E 




0FFFFFFFF 
1FEDCBA10 
AFDB1CE0A 
BFC1A0BED 
CFBC01DAE 
DFAEBD0C1 
EF10EACDB 
FF0ADE1BC 



Solution to exercise 13.14 (p. 220). Quick, rough proof of the theorem. Let x 

denote the difference between the reconstructed codeword and the transmitted 
codeword. For any given channel output r, there is a posterior distribution 
over x. This posterior distribution is positive only on vectors x belonging 
to the code; the sums that follow are over codewords x. The block error 
probability is: 

PB = ^^(x|r). (13.46) 

x#0 

The average bit error probability, averaging over all bits in the codeword, is: 



Pb 



£P(x|r) 

x^0 



w(x) 



(13.47) 



where w(x) is the weight of codeword x. Now the weights of the non-zero 
codewords satisfy 

(13.48) 



> w(x) > d min 



N 



N 



Substituting the inequalities (13.48) into the definitions (13.46, 13.47), we ob- 
tain: 

. . ^min /., A ~\ 

PB>Pb> -rrPB, (13.49) 



which is a factor of two stronger, on the right, than the stated result (13.39). 
In making the proof watertight, I have weakened the result a little. 

Careful proof. The theorem relates the performance of the optimal block de- 
coding algorithm and the optimal bitwise decoding algorithm. 

We introduce another pair of decoding algorithms, called the block- 
guessing decoder and the bit-guessing decoder. The idea is that these two 
algorithms are similar to the optimal block decoder and the optimal bitwise 
decoder, but lend themselves more easily to analysis. 

We now define these decoders. Let x denote the inferred codeword. For 
any given code: 
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13.14: Solutions 

The optimal block decoder returns the codeword x that maximizes the 
posterior probability P(x | r), which is proportional to the likelihood 
P(r |x). 

The probability of error of this decoder is called pb- 

The optimal bit decoder returns for each of the N bits, x n , the 

value of a that maximizes the posterior probability P(x n = a | r) = 
£ x P(x|r)l(x n = o]. 

The probability of error of this decoder is called p-^- 

The block-guessing decoder returns a random codeword x with probabil- 
ity distribution given by the posterior probability P(x | r). 

The probability of error of this decoder is called pg. 

The bit-guessing decoder returns for each of the N bits, x n , a random bit 
from the probability distribution P(x n = a | r). 

The probability of error of this decoder is called p^. 

The theorem states that the optimal bit error probability pb is bounded above 
by pb and below by a given multiple of pb (13.39). 

The left-hand inequality in (13.39) is trivially true - if a block is correct, all 
its constituent bits are correct; so if the optimal block decoder outperformed 
the optimal bit decoder, we could make a better bit decoder from the block 
decoder. 

We prove the right-hand inequality by establishing that: 

(a) the bit-guessing decoder is nearly as good as the optimal bit decoder: 

< 2 p h . (13.50) 

(b) the bit-guessing decoder's error probability is related to the block- 
guessing decoder's by 



G \ ^min G i ■ > - i , 

Pb > —Pb- (!3-51) 



Then since pg > pb, we have 



1 G 1 <i m i n G 1 d mm 
Ph > 2 Ph ~ 2~ PB ~ 2— pB - (13 - 52) 

We now prove the two lemmas. 

Near-optimality of guessing: Consider first the case of a single bit, with posterior 
probability {po,pi}- The optimal bit decoder has probability of error 

P optimal = min(p , Pl ). (13.53) 

The guessing decoder picks from and 1. The truth is also distributed with 
the same probability. The probability that the guesser and the truth match is 
Pq+Pi', the probability that they mismatch is the guessing error probability, 

pguess = 2popi < 2min(p ,Pi) = 2P optimal . (13.54) 

Since p^ is the average of many such error probabilities, ps uess ; and pb is the 
average of the corresponding optimal error probabilities, p°P tlma! ; we obtain 
the desired relationship (13.50) between p^j and p^,. □ 
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226 13 — Binary Codes 

Relationship between bit error probability and block error probability: The bit- 
guessing and block-guessing decoders can be combined in a single system: we 
can draw a sample x n from the marginal distribution P(x n | r) by drawing 
a sample (x n ,x) from the joint distribution P(x n ,x\r), then discarding the 
value of x. 

We can distinguish between two cases: the discarded value of x is the 
correct codeword, or not. The probability of bit error for the bit- guessing 
decoder can then be written as a sum of two terms: 

= -P( x correct)P(bit error | x correct) 

+ _P(x incorrect)P(bit error | x incorrect) 
= + pgP(bit error | x incorrect). 

Now, whenever the guessed x is incorrect, the true x must differ from it in at 
least d bits, so the probability of bit error in these cases is at least d/N. So 

QED. □ 

Solution to exercise 13.20 (p. 222). The number of 'typical' noise vectors n 
is roughly 2 NH2 ^>. The number of distinct syndromes z is 2 M . So reliable 
communication implies 

M > NH 2 {f), (13.55) 
or, in terms of the rate R = 1 — M/N, 

R<l-H 2 (f), (13.56) 

a bound which agrees precisely with the capacity of the channel. 
This argument is turned into a proof in the following chapter. 

Solution to exercise 13.24 (p. 222). In the three-player case, it is possible for 
the group to win three-quarters of the time. 

Three-quarters of the time, two of the players will have hats of the same 
colour and the third player's hat will be the opposite colour. The group can 
win every time this happens by using the following strategy. Each player looks 
at the other two players' hats. If the two hats are different colours, he passes. 
If they are the same colour, the player guesses his own hat is the opposite 
colour. 

This way, every time the hat colours are distributed two and one, one 
player will guess correctly and the others will pass, and the group will win the 
game. When all the hats are the same colour, however, all three players will 
guess incorrectly and the group will lose. 

When any particular player guesses a colour, it is true that there is only a 
50:50 chance that their guess is right. The reason that the group wins 75% of 
the time is that their strategy ensures that when players are guessing wrong, 
a great many are guessing wrong. 

For larger numbers of players, the aim is to ensure that most of the time 
no one is wrong and occasionally everyone is wrong at once. In the game with 
7 players, there is a strategy for which the group wins 7 out of every 8 times 
they play. In the game with 15 players, the group can win 15 out of 16 times. 
If you have not figured out these winning strategies for teams of 7 and 15, 
I recommend thinking about the solution to the three-player game in terms 
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13.14: Solutions 

of the locations of the winning and losing states on the three-dimensional 
hypercube, then thinking laterally. 

If the number of players, TV", is 2 r — 1, the optimal strategy can be defined 
using a Hamming code of length N, and the probability of winning the prize 
is N/(N + 1). Each player is identified with a number n € 1 . . . N. The two 
colours are mapped onto and 1. Any state of their hats can be viewed as a 
received vector out of a binary channel. A random binary vector of length N 
is either a codeword of the Hamming code, with probability l/(N +1), or it 
differs in exactly one bit from a codeword. Each player looks at all the other 
bits and considers whether his bit can be set to a colour such that the state is 
a codeword (which can be deduced using the decoder of the Hamming code). 
If it can, then the player guesses that his hat is the other colour. If the state is 
actually a codeword, all players will guess and will guess wrong. If the state is 
a non-codeword, only one player will guess, and his guess will be correct. It's 
quite easy to train seven players to follow the optimal strategy if the cyclic 
representation of the (7,4) Hamming code is used (p. 19). 
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About Chapter 14 



In this chapter we will draw together several ideas that we've encountered 
so far in one nice short proof. We will simultaneously prove both Shannon's 
noisy-channel coding theorem (for symmetric binary channels) and his source 
coding theorem (for binary sources). While this proof has connections to many 
preceding chapters in the book, it's not essential to have read them all. 

On the noisy-channel coding side, our proof will be more constructive than 
the proof given in Chapter 10; there, we proved that almost any random code 
is 'very good'. Here we will show that almost any linear code is very good. We 
will make use of the idea of typical sets (Chapters 4 and 10), and we'll borrow 
from the previous chapter's calculation of the weight enumerator function of 
random linear codes (section 13.5). 

On the source coding side, our proof will show that random linear hash 
functions can be used for compression of compressible binary sources, thus 
giving a link to Chapter 12. 
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14 



Very Good Linear Codes Exist 



In this chapter we'll use a single calculation to prove simultaneously the source 
coding theorem and the noisy-channel coding theorem for the binary symmet- 
ric channel. 

Incidentally this proof works for much more general channel models, not 
only the binary symmetric channel. For example, the proof can be reworked 
for channels with non-binary outputs, for time- varying channels and for chan- 
nels with memory, as long as they have binary inputs satisfying a symmetry 
property, cf. section 10.6. 



► 14.1 A simultaneous proof of the source coding and noisy-channel 
coding theorems 

We consider a linear error-correcting code with binary parity-check matrix H. 
The matrix has M rows and N columns. Later in the proof we will increase 
N and M, keeping M oc N. The rate of the code satisfies 



R > 1 



M 
AT 



(14.1) 



If all the rows of H are independent then this is an equality, R = 1 — M/N. 
In what follows, we'll assume the equality holds. Eager readers may work out 
the expected rank of a random binary matrix H (it's very close to M) and 
pursue the effect that the difference (M — rank) has on the rest of this proof 
(it's negligible). 

A codeword t is selected, satisfying 



Ht = 0mod2. 



(14.2) 



and a binary symmetric channel adds noise x, giving the received signal 



r = t + xmod 2. 



(14.3) 



In this chapter x denotes the 
noise added by the channel, not 
the input to the channel. 



The receiver aims to infer both t and x from r using a syndrome-decoding 
approach. Syndrome decoding was first introduced in section 1.2 (p. 10 and 
11). The receiver computes the syndrome 



Hrmod2 = Ht + Hxmod2 = Hxmod2. 



(14.4) 



The syndrome only depends on the noise x, and the decoding problem is to 
find the most probable x that satisfies 



Hx = zmod2. 



(14.5) 
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14 — Very Good Linear Codes Exist 



This best estimate for the noise vector, x, is then subtracted from r to give the 
best guess for t. Our aim is to show that, as long as R < 1 — H(X) = 1 — i?2(/), 
where / is the flip probability of the binary symmetric channel, the optimal 
decoder for this syndrome-decoding problem has vanishing probability of error, 
as N increases, for random H. 

We prove this result by studying a sub-optimal strategy for solving the 
decoding problem. Neither the optimal decoder nor this typical-set decoder 
would be easy to implement, but the typical-set decoder is easier to analyze. 
The typical-set decoder examines the typical set T of noise vectors, the set 
of noise vectors x' that satisfy log Vp(x') ~ NH(X), checking to see if any of 
those typical vectors x' satisfies the observed syndrome, 



Hx' 



z. 



(14.6) 



If exactly one typical vector x' does so, the typical set decoder reports that 
vector as the hypothesized noise vector. If no typical vector matches the 
observed syndrome, or more than one does, then the typical set decoder reports 
an error. 

The probability of error of the typical-set decoder, for a given matrix H, 
can be written as a sum of two terms, 



-Pts|h — 



TS|H' 



(14.7) 



where PV' is the probability that the true noise vector x is itself not typical, 
and frpg^jj is the probability that the true x is typical and at least one other 
typical vector clashes with it. The first probability vanishes as N increases, as 
we proved when we first studied typical sets (Chapter 4). We concentrate on 
the second probability. To recap, we're imagining a true noise vector, x; and 
if any of the typical noise vectors x', different from x, satisfies H(x' — x) = 0, 
then we have an error. We use the truth function 



We'll leave out the es and /?s that 
make a typical-set definition 
rigorous. Enthusiasts are 
encouraged to revisit section 4.4 
and put these details into this 
proof. 



1 [H(x' 







(14.; 



whose value is one if the statement H(x' — x) = is true and zero otherwise. 
We can bound the number of type II errors made when the noise is x thus: 



[Number of errors given x and H] < 1 [H(x' - x) = 0] . (14.9) 



E 

/. x' 6 T 
' x'^ x 



The number of errors is either zero or one; the sum on the right-hand side 
may exceed one, in cases where several typical noise vectors have the same 
syndrome. 

We can now write down the probability of a type-II error by averaging over 
p ts\u < E p ( x ) E 1 [ R ( x ' " x ) = °] • ( 1410 ) 



• x'# X 



Now, we will find the average of this probability of type-II error over all linear 
codes by averaging over H. By showing that the average probability of type-II 
error vanishes, we will thus show that there exist linear codes with vanishing 
error probability, indeed, that almost all linear codes are very good. 

We denote averaging over all binary matrices H by (. . .) H . The average 
probability of type-II error is 



TS 



E P ( H ) P §H = (P, 



,{II) 

TSIH 



H 



H 



(14.11) 



Equation (14.9) is a union bound. 
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14.2: Data compression by linear hash codes 



= (E P ( X ) E l[H(x'-x) = 0]\ (14.12) 
^P(x) £ <l[H(x'-x) = 0]) H . (14.13) 



A ' x V X 



Now, the quantity (l[H(x' — x) = 0]) H already cropped up when we were 
calculating the expected weight enumerator function of random linear codes 
(section 13.5): for any non-zero binary vector v, the probability that Hv = 0, 
averaging over all matrices H, is 2~ M . So 



p(n) 



5>(x))(|T|-l)2- M (14.14) 

vxer / 



T\2- M , (14.15) 



where \T\ denotes the size of the typical set. As you will recall from Chapter 
4, there are roug hly 2 NH ^ noise vectors in the typical set. So 

P&V < 2 NH ^ x h- M . (14.16) 

This bound on the probability of error either vanishes or grows exponentially 
as N increases (remembering that we are keeping M proportional to N as N 
increases). It vanishes if 

H(X) < M/N. (14.17) 

Substituting R = 1 — M/N, we have thus established the noisy-channel coding 
theorem for the binary symmetric channel: very good linear codes exist for 
any rate R satisfying 

R<1-H(X), (14.18) 
where H(X) is the entropy of the channel noise, per bit. □ 

Exercise 14. l} 3 ^ Redo the proof for a more general channel. 



► 14.2 Data compression by linear hash codes 

The decoding game we have just played can also be viewed as an uncompres- 
sion game. The world produces a binary noise vector x from a source P(x). 
The noise has redundancy (if the flip probability is not 0.5). We compress it 
with a linear compressor that maps the TV-bit input x (the noise) to the M-bit 
output z (the syndrome). Our uncompression task is to recover the input x 
from the output z. The rate of the compressor is 

^compressor = M/N. (14.19) 

[We don't care about the possibility of linear redundancies in our definition 
of the rate, here.] The result that we just found, that the decoding problem 
can be solved, for almost any H, with vanishing error probability, as long as 
H{X) < M/N, thus instantly proves a source coding theorem: 

Given a binary source X of entropy H(X), and a required com- 
pressed rate R > H(X), there exists a linear compressor x — > z = 
Hx mod 2 having rate M/N equal to that required rate R, and an 
associated uncompressor, that is virtually lossless. 

This theorem is true not only for a source of independent identically dis- 
tributed symbols but also for any source for which a typical set can be de- 
fined: sources with memory, and time- varying sources, for example; all that's 
required is that the source be ergodic. 
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232 14 — Very Good Linear Codes Exist 



Notes 

This method for proving that codes are good can be applied to other linear 
codes, such as low-density parity-check codes (MacKay, 1999b; Aji et al., 2000). 
For each code we need an approximation of its expected weight enumerator 
function. 
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15 



Further Exercises on Information Theory 



The most exciting exercises, which will introduce you to further ideas in in- 
formation theory, are towards the end of this chapter. 

Refresher exercises on source coding and noisy channels 

> Exercise 15.1. ^ Let X be an ensemble with Ax = {0,1} and Vx = 

{0.995,0.005}. Consider source coding using the block coding of X 100 
where every x € X 100 containing 3 or fewer Is is assigned a distinct 
codeword, while the other xs are ignored. 

(a) If the assigned codewords are all of the same length, find the min- 
imum length required to provide the above set with distinct code- 
words. 

(b) Calculate the probability of getting an x that will be ignored. 

> Exercise 15.2. [2] Let X be an ensemble with V x = {0.1,0.2,0.3,0.4}. The en- 

semble is encoded using the symbol code C = {0001, 001, 01, 1}. Consider 
the codeword corresponding to x G X N , where N is large. 



(a) Compute the entropy of the fourth bit of transmission. 

(b) Compute the conditional entropy of the fourth bit given the third 
bit. 

(c) Estimate the entropy of the hundredth bit. 

(d) Estimate the conditional entropy of the hundredth bit given the 
ninety- ninth bit. 



Exercise 15.3. [ ' Two fair dice are rolled by Alice and the sum is recorded. 
Bob's task is to ask a sequence of questions with yes/no answers to find 
out this number. Devise in detail a strategy that achieves the minimum 
possible average number of questions. 

> Exercise 15.4. ^ How can you use a coin to draw straws among 3 people? 

> Exercise 15.5. ^ In a magic trick, there are three participants: the magician, 

an assistant, and a volunteer. The assistant, who claims to have paranor- 
mal abilities, is in a soundproof room. The magician gives the volunteer 
six blank cards, five white and one blue. The volunteer writes a dif- 
ferent integer from 1 to 100 on each card, as the magician is watching. 
The volunteer keeps the blue card. The magician arranges the five white 
cards in some order and passes them to the assistant. The assistant then 
announces the number on the blue card. 

How does the trick work? 
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234 15 — Further Exercises on Information Theory 

> Exercise 15.6. How does this trick work? 

'Here's an ordinary pack of cards, shuffled into random order. 
Please choose five cards from the pack, any that you wish. 
Don't let me see their faces. No, don't give them to me: pass 
them to my assistant Esmerelda. She can look at them. 
'Now, Esmerelda, show me four of the cards. Hmm. . . nine 
of spades, six of clubs, four of hearts, ten of diamonds. The 
hidden card, then, must be the queen of spades!' 

The trick can be performed as described above for a pack of 52 cards. 
Use information theory to give an upper bound on the number of cards 
for which the trick can be performed. 

> Exercise 15.7. ^ Find a probability sequence p = (pi,p2,...) such that 

H(p) = oo. 

> Exercise 15.8. ^ Consider a discrete memoryless source with Ax = {», b, c, d} 

and V x = {1/2, 1/ '4, 1/8, 1/8}. There are 4 8 = 65 536 eight-letter words 
that can be formed from the four letters. Find the total number of such 
words that are in the typical set T^p (equation 4.29) where N = 8 and 
/3= 0.1. 

t> Exercise 15.9. ^ Consider the source As = {a,b,c,d,e}, Vs = 
{V3, V3, V9, V9, V9} and the channel whose transition probability matrix 
is 

10 
% 
10 1 
Vz 



Q 



(15.1) 



Note that the source alphabet has five symbols, but the channel alphabet 
Ax = Ay = {0, 1, 2, 3} has only four. Assume that the source produces 
symbols at exactly 3/4 the rate that the channel accepts channel sym- 
bols. For a given (tiny) e > 0, explain how you would design a system 
for communicating the source's output over the channel with an aver- 
age error probability per source symbol less than e. Be as explicit as 
possible. In particular, do not invoke Shannon's noisy-channel coding 
theorem. 

> Exercise 15.10. ^ Consider a binary symmetric channel and a code C = 
{0000, 00 11, 1100, 1111}; assume that the four codewords are used with 
probabilities {1/2, 1/8, 1/8, 1/4}. 

What is the decoding rule that minimizes the probability of decoding 
error? [The optimal decoding rule depends on the noise level / of the 
binary symmetric channel. Give the decoding rule for each range of 
values of /, for / between and 1/2.] 

• ' Exercise 15.11. ^ Find the capacity and optimal input distribution for the 
three-input, three-output channel whose transition probabilities are: 



Q 



1 

% 1/3 
1/3 % 



(15.2) 
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fl Exercise 15.12. t 5 ' p - 239 ^ The input to a channel Q is a word of 8 bits. The 
output is also a word of 8 bits. Each time it is used, the channel flips 
exactly one of the transmitted bits, but the receiver does not know which 
one. The other seven bits are received without error. All 8 bits are 
equally likely to be the one that is flipped. Derive the capacity of this 
channel. 

Show, by describing an explicit encoder and decoder that it is possible 
reliably (that is, with zero error probability) to communicate 5 bits per 
cycle over this channel. 

t> Exercise 15.13. ^ A channel with input x £ {a, b, c} and output y £ {r, s, t, u} 
has conditional probability matrix: 



Q 



1/2 o o 

1/2 1/2 

1/2 1/2 

V2 



< 



What is its capacity? 

> Exercise 15.14. ^ The ten-digit number on the cover of a book known as the 
ISBN incorporates an error-detecting code. The number consists of nine 
source digits X\,X2, • • • ,xg, satisfying x n £ {0,1,..., 9}, and a tenth 
check digit whose value is given by 



mod 11. 



0- 521-64298-1 

1- 010-00000-4 

Table 15.1. Some valid ISBNs. 
[The hyphens are included for 
legibility.] 



Here x w £ {0, 1, ... ,9, 10}. If x w = 10 then the tenth digit is shown 
using the roman numeral X. 

Show that a valid ISBN satisfies: 



f 10 

£ 

\n=\ 



mod 11 = 0. 



Imagine that an ISBN is communicated over an unreliable human chan- 
nel which sometimes modifies digits and sometimes reorders digits. 

Show that this code can be used to detect (but not correct) all errors in 
which any one of the ten digits is modified (for example, 1-010-00000-4 
-> 1-010-00080-4). 

Show that this code can be used to detect all errors in which any two ad- 
jacent digits are transposed (for example, 1-010-00000-4 -> 1-100-00000- 
4). 

What other transpositions of pairs of non-adjacent digits can be de- 
tected? 

If the tenth digit were defined to be 



Xw = ^2 nx n I mod 10, 



\n=l 



why would the code not work so well? (Discuss the detection of both 
modifications of single digits and transpositions of digits.) 
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Exercise 15. 15. A channel with input x and output y has transition proba- 
bility matrix: 



Q 



1-/ 


/ 








/ 


l-f 














1-5 


9 








5 


1-5 



Assuming an input distribution of the form 

' p p 1 — p 



2'2' 



\_-p\ 

2 /' 



write down the entropy of the output, H(Y), and the conditional entropy 
of the output given the input, H(Y\X). 

Show that the optimal input distribution is given by 

1 

p = 



1 + 2 -H2(g)+H 2 (f) ' 



where H 2 (f) = /log 2 } + (1 - /) log 2 ^yy. 

Write down the optimal input distribution and the capacity of the chan- 
nel in the case / = 1/2, g = 0, and comment on your answer. 

t> Exercise 15.16.^ What are the differences in the redundancies needed in an 
error-detecting code (which can reliably detect that a block of data has 
been corrupted) and an error-correcting code (which can detect and cor- 
rect errors)? 



b< 
c* 
d< 



a 

b 
c 
d 



Remember ^H 2 {p) = log 2 1^. 



Further tales from information theory 

The following exercises give you the chance to discover for yourself the answers 
to some more surprising results of information theory. 

Exercise 15.17.^ Communication of information from correlated sources. Imag- 
ine that we want to communicate data from two data sources and 
to a central location C via noise-free one-way communication channels (fig- 
ure 15.2a). The signals x^ A > and x^ are strongly dependent, so their joint 
information content is only a little greater than the marginal information con- 
tent of either of them. For example, C is a weather collator who wishes to 
receive a string of reports saying whether it is raining in Allerton (x^ A ') and 
whether it is raining in Bognor (x^). The joint probability of x^ and x^ 
might be 

P ( x {A)^ x (B)y X (A) 
1 

0.49 0.01 

0.01 0.49 (15.3) 

The weather collator would like to know N successive values of x^ and x^ 
exactly, but, since he has to pay for every bit of information he receives, he 
is interested in the possibility of avoiding buying N bits from source A and 
N bits from source B. Assuming that variables x^ and x^ are generated 
repeatedly from this distribution, can they be encoded at rates Ra and Rb in 
such a way that C can reconstruct all the variables, with the sum of information 
transmission rates on the two lines being less than two bits per cycle? 
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encode 
X (A) . t (A) 



(a) 



r(B) 



Ra 

encode 
Rb 



C 



t (B) 



Kb, 
H ( X ( A \XW) 

H{X^\X^) 



(b) 



H(X (A) \X (B) ) H(X (A) ) 



Ra 



The answer, which you should demonstrate, is indicated in figure 15.2. 
In the general case of two dependent sources X^ A > and X( B \ there exist 
codes for the two transmitters that can achieve reliable communication of 
both X^ and X^ B > to C, as long as: the information rate from X^ A >, 
Ra, exceeds H(X^ \ X^); the information rate from X^ B \ Rb, exceeds 
H(X^ \X^); and the total information rate Ra + Rb exceeds the joint 
entropy H{X^ A \ X^) (Slepian and Wolf, 1973). 

So in the case of x^ A ' and x^ B ' above, each transmitter must transmit at 
a rate greater than -02(0. 02) = 0.14 bits, and the total rate Ra + Rb must 
be greater than 1.14 bits, for example Ra = 0.6, Rb = 0.6. There exist codes 
that can achieve these rates. Your task is to figure out why this is so. 

Try to find an explicit solution in which one of the sources is sent as plain 
text, t( s ) = x.( B \ and the other is encoded. 

bxercise 15.18. [3] Multiple access channels. Consider a channel with two sets 
of inputs and one output - for example, a shared telephone line (figure 15.3a). 
A simple model system has two binary inputs x^ A ' and and a ternary 
output y equal to the arithmetic sum of the two inputs, that's 0, 1 or 2. There 
is no noise. Users A and B cannot communicate with each other, and they 
cannot hear the output of the channel. If the output is a 0, the receiver can 
be certain that both inputs were set to 0; and if the output is a 2, the receiver 
can be certain that both inputs were set to 1. But if the output is 1, then 
it could be that the input state was (0, 1) or (1,0). How should users A and 
B use this channel so that their messages can be deduced from the received 
signals? How fast can A and B communicate? 

Clearly the total information rate from A and B to the receiver cannot 
be two bits. On the other hand it is easy to achieve a total information rate 
Ra + Rb of one bit. Can reliable communication be achieved at rates (Ra, Rb) 
such that Ra + Rb > 1? 

The answer is indicated in figure 15.3. 

Some practical codes for multi-user channels are presented in Ratzer and 
MacKay (2003). 

Exercise 15.19.^ Broadcast channels. A broadcast channel consists of a single 
transmitter and two or more receivers. The properties of the channel are de- 
fined by a conditional distribution Q(y (We'll assume the channel 
is memoryless.) The task is to add an encoder and two decoders to enable 
reliable communication of a common message at rate Ro to both receivers, an 
individual message at rate Ra to receiver A, and an individual message at rate 
Rb to receiver B. The capacity region of the broadcast channel is the convex 
hull of the set of achievable rate triplets (Rq,Ra,Rb)- 

A simple benchmark for such a channel is given by time-sharing (time- 
division signaling). If the capacities of the two channels, considered separately, 



Figure 15.2. Communication of 
information from dependent 
sources, (a) x^ and x^ arc 
dependent sources (the 
dependence is represented by the 
dotted arrow) . Strings of values of 
each variable are encoded using 
codes of rate Ra and Rb into 
transmissions t' A ' and t^ B \ which 
arc communicated over noise-free 
channels to a receiver C . (b) The 
achievable rate region. Both 
strings can be conveyed without 
error even though Ra < H(X^) 
and R B < H(X^). 



X 

Figure 15.4. The broadcast 
channel, x is the channel input: 
y( A ) and j/ s ' are the outputs. 
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(a) 



(b) 



X (A) 
X {B) 



y- 



P{y\x( A \x^) 



X (A) 

1 

1 

1 2 



Rb 
1 

1/2 



(c) 




Achievable 



1/2 



Ra 



Figure 15.3. Multiple access 
channels, (a) A general multiple 
access channel with two 
transmitters and one receiver, (b) 
A binary multiple access channel 
with output equal to the sum of 
two inputs, (c) The achievable 
region. 



are and C^ B \ then by devoting a fraction cj>A of the transmission time 

to channel A and 4>b = ^ — 4>a to channel B, we can achieve (Rq, Ra, Rb) = 
(O^aC^^bC^). 

We can do better than this, however. As an analogy, imagine speaking 
simultaneously to an American and a Belarusian; you are fluent in American 
and in Belarusian, but neither of your two receivers understands the other's 
language. If each receiver can distinguish whether a word is in their own 
language or not, then an extra binary file can be conveyed to both recipients by 
using its bits to decide whether the next transmitted word should be from the 
American source text or from the Belarusian source text. Each recipient can 
concatenate the words that they understand in order to receive their personal 
message, and can also recover the binary string. 

An example of a broadcast channel consists of two binary symmetric chan- 
nels with a common input. The two halves of the channel have flip prob- 
abilities Ja and Jb- We'll assume that A has the better half-channel, i.e., 
Ia < Ib < V2. [A closely related channel is a 'degraded' broadcast channel, 
in which the conditional probabilities are such that the random variables have 
the structure of a Markov chain, 




C (A) Ra 

Figure 15.5. Rates achievable by 
simple timesharing. 



y 



(A) 



y 



(B) 



(15.4) 



R 



i.e., y( B ) is a further degraded version of y( A > .] In this special case, it turns 
out that whatever information is getting through to receiver B can also be 
recovered by receiver A. So there is no point distinguishing between Ro and 
Rb- the task is to find the capacity region for the rate pair (Ro,Ra), where 
Ro is the rate of information reaching both A and B, and Ra is the rate of 
the extra information reaching A. 

The following exercise is equivalent to this one, and a solution to it is 
illustrated in figure 15.8. 

Exercise 15.20. Variable-rate error-correcting codes for channels with unknown 
noise level. In real life, channels may sometimes not be well characterized 
before the encoder is installed. As a model of this situation, imagine that a 
channel is known to be a binary symmetric channel with noise level either f a 
or fs- Let fs > /a, and let the two capacities be Ca and Cb- 

Those who like to live dangerously might install a system designed for noise 
level j 'a with rate Ra — Ca', in the event that the noise level turns out to be 
fs, our experience of Shannon's theories would lead us to expect that there 



Cn 



Figure 15.6. Rate of reliable 
communication R, as a function of 
noise level /, for Shannonesque 
codes designed to operate at noise 
levels /a (solid line) and /b 
(dashed line). 
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would be a catastrophic failure to communicate information reliably (solid line 
in figure 15.6). 

A conservative approach would design the encoding system for the worst- 
case scenario, installing a code with rate Rb — Cb (dashed line in figure 15.6). 
In the event that the lower noise level, Ja, holds true, the managers would 
have a feeling of regret because of the wasted capacity difference C 'a — Rb- 

Is it possible to create a system that not only transmits reliably at some 
rate Rq whatever the noise level, but also communicates some extra, 'lower- 
priority' bits if the noise level is low, as shown in figure 15.7? This code 
communicates the high-priority bits reliably at all noise levels between j a and 
fs, and communicates the low-priority bits also if the noise level is f a or 
below. 

This problem is mathematically equivalent to the previous problem, the 
degraded broadcast channel. The lower rate of communication was there called 
-Rrji and the ra te at which the low-priority bits are communicated if the noise 
level is low was called Ra- 

An illustrative answer is shown in figure 15.8, for the case f a = 0.01 and 
Jb = 0.1. (This figure also shows the achievable region for a broadcast channel 
whose two half-channels have noise levels Ja = 0.01 and fs = 0.1.) I admit I 
find the gap between the simple time-sharing solution and the cunning solution 
disappointingly small. 

In Chapter 50 we will discuss codes for a special class of broadcast channels, 
namely erasure channels, where every symbol is either received without error 
or erased. These codes have the nice property that they are rateless - the 
number of symbols transmitted is determined on the fly such that reliable 
comunication is achieved, whatever the erasure statistics of the channel. 

Exercise 15.21. ^ Multiterminal information networks are both important practi- 
cally and intriguing theoretically. Consider the following example of a two-way 
binary channel (figure 15.9a,b): two people both wish to talk over the channel, 
and they both want to hear what the other person is saying; but you can only 
hear the signal transmitted by the other person if you are transmitting a zero. 
What simultaneous information rates from A to B and from B to A can be 
achieved, and how? Everyday examples of such networks include the VHF 
channels used by ships, and computer ethernet networks (in which all the 
devices are unable to hear anything if two or more devices are broadcasting 
simultaneously) . 

Obviously, we can achieve rates of V2 in both directions by simple time- 
sharing. But can the two information rates be made larger? Finding the 
capacity of a general two-way channel is still an open problem. However, 
we can obtain interesting results concerning achievable points for the simple 
binary channel discussed above, as indicated in figure 15.9c. There exist codes 
that can achieve rates up to the boundary shown. There may exist better 
codes too. 

Solutions 

Solution to exercise 15.12 (p. 235). C(Q) = 5 bits. 

Hint for the last part: a solution exists that involves a simple (8,5) code. 
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R 




f A f B f 

Figure 15.7. Rate of reliable 
communication R, as a function of 
noise level /, for a desired 
variable-rate code. 



0.6 -| 
0.4 - 
0.2 - 

- 

0.2 0.4 0.6 0.8 1 

Figure 15.8. An achievable region 
for the channel with unknown 
noise level. Assuming the two 
possible noise levels arc /a = 0.01 
and fs = 0.1, the dashed lines 
show the rates Ra, Rb that are 
achievable using a simple 
time-sharing approach, and the 
solid line shows rates achievable 
using a more cunning approach. 
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X {A) 
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Figure 15.9. (a) A general 
two-way channel, (b) The rules 
for a binary two-way channel. 
The two tables show the outputs 
j/ A ' and j/( B ) that result for each 
state of the inputs, (c) Achievable 
region for the two-way binary 
channel. Rates below the solid 
line are achievable. The dotted 
line shows the 'obviously 
achievable' region which can be 
attained by simple time-sharing. 



(c) 
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Message Passing 



One of the themes of this book is the idea of doing complicated calculations 
using simple distributed hardware. It turns out that quite a few interesting 
problems can be solved by message-passing algorithms, in which simple mes- 
sages are passed locally among simple processors whose operations lead, after 
some time, to the solution of a global problem. 

► 16.1 Counting 

As an example, consider a line of soldiers walking in the mist. The commander 
wishes to perform the complex calculation of counting the number of soldiers 
in the line. This problem could be solved in two ways. 

First there is a solution that uses expensive hardware: the loud booming 
voices of the commander and his men. The commander could shout 'all soldiers 
report back to me within one minute!', then he could listen carefully as the 
men respond 'Molesworth here sir!', 'Fotherington-Thomas here sir!', and so 
on. This solution relies on several expensive pieces of hardware: there must be 
a reliable communication channel to and from every soldier; the commander 
must be able to listen to all the incoming messages - even when there are 
hundreds of soldiers - and must be able to count; and all the soldiers must be 
well-fed if they are to be able to shout back across the possibly-large distance 
separating them from the commander. 

The second way of finding this global function, the number of soldiers, 
does not require global communication hardware, high IQ, or good food; we 
simply require that each soldier can communicate single integers with the two 
adjacent soldiers in the line, and that the soldiers are capable of adding one 
to a number. Each soldier follows these rules: 



1. If you are the front soldier in the line, say the number 'one' to the 
soldier behind you. 

2. If you are the rearmost soldier in the line, say the number 'one' to 
the soldier in front of you. 

3. If a soldier ahead of or behind you says a number to you, add one 
to it, and say the new number to the soldier on the other side. 



If the clever commander can not only add one to a number, but also add 
two numbers together, then he can find the global number of soldiers by simply 
adding together: 



Algorithm 16.1. Message-passing 
rule-set A. 
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the number said to him by the 


(which equals the total number of 


soldier in front of him, 


soldiers in front) 


+ the number said to the com- 


(which is the number behind) 


mander by the soldier behind 




him, 




+ one 


(to count the commander himself). 



This solution requires only local communication hardware and simple compu- 
tations (storage and addition of integers). 



Separation 




2 1 

Commander 



Figure 16.2. A line of soldiers 
counting themselves using 
message-passing rule-set A. The 
commander can add '3' from the 
soldier in front, '1' from the 
soldier behind, and T for himself, 
and deduce that there are 5 
soldiers in total. 



This clever trick makes use of a profound property of the total number of 
soldiers: that it can be written as the sum of the number of soldiers in front 
of a point and the number behind that point, two quantities which can be 
computed separately, because the two groups are separated by the commander. 

If the soldiers were not arranged in a line but were travelling in a swarm, 
then it would not be easy to separate them into two groups in this way. The 




Figure 16.3. A swarm of guerillas. 



guerillas in figure 16.3 could not be counted using the above message-passing 
rule-set A, because, while the guerillas do have neighbours (shown by lines), 
it is not clear who is 'in front' and who is 'behind'; furthermore, since the 
graph of connections between the guerillas contains cycles, it is not possible 
for a guerilla in a cycle (such as 'Jim') to separate the group into two groups, 
'those in front', and 'those behind'. 

A swarm of guerillas can be counted by a modified message-passing algo- 
rithm if they are arranged in a graph that contains no cycles. 

Rule-set B is a message-passing algorithm for counting a swarm of guerillas 
whose connections form a cycle-free graph, also known as a tree, as illustrated 
in figure 16.4. Any guerilla can deduce the total in the tree from the messages 
that they receive. 
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Jim 




A 



Figure 16.4. A swarm of guerillas 
whose connections form a tree. 



1 



Count your number of neighbours, N. 



Algorithm 16.5. Message-passing 
rule-set B. 



2. Keep count of the number of messages you have received from your 
neighbours, m, and of the values v\, t>2, ...,vn of each of those 
messages. Let V be the running total of the messages you have 
received. 

3. If the number of messages you have received, m, is equal to N — 1, 
then identify the neighbour who has not sent you a message and tell 
them the number V + 1. 

4. If the number of messages you have received is equal to N, then: 

(a) the number V + 1 is the required total. 

(b) for each neighbour n { 



say to neighbour n the number V + 1 — u 



} 



A 



Figure 16.6. A triangular 41 x 41 
grid. How many paths are there 
from A to B? One path is shown. 




B 
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► 16.2 Path-counting 

A more profound task than counting squaddies is the task of counting the 
number of paths through a grid, and finding how many paths pass through 
any given point in the grid. 

Figure 16.6 shows a rectangular grid, and a path through the grid, con- 
necting points A and B. A valid path is one that starts from A and proceeds 
to B by rightward and downward moves. Our questions are: 

1. How many such paths are there from A to B? 

2. If a random path from A to B is selected, what is the probability that it 
passes through a particular node in the grid? [When we say 'random', 
we mean that all paths have exactly the the same probability of being 
selected.] 

3. How can a random path from A to B be selected? 

Counting all the paths from A to B doesn't seem straightforward. The number 
of paths is expected to be pretty big - even if the permitted grid were a diagonal 
strip only three nodes wide, there would still be about 2 N I 2 possible paths. 

The computational breakthrough is to realize that to find the number of 
paths, we do not have to enumerate all the paths explicitly. Pick a point P in 
the grid and consider the number of paths from A to P. Every path from A 
to P must come in to P through one of its upstream neighbours ('upstream' 
meaning above or to the left). So the number of paths from A to P can be 
found by adding up the number of paths from A to each of those neighbours. 

This message-passing algorithm is illustrated in figure 16.8 for a simple 
grid with ten vertices connected by twelve directed edges. We start by send- 
ing the '1' message from A. When any node has received messages from all its 
upstream neighbours, it sends the sum of them on to its downstream neigh- 
bours. At B, the number 5 emerges: we have counted the number of paths 
from A to B without enumerating them all. As a sanity-check, figure 16.9 
shows the five distinct paths from A to B. 

Having counted all paths, we can now move on to more challenging prob- 
lems: computing the probability that a random path goes through a given 
vertex, and creating a random path. 

Probability of passing through a node 

By making a backward pass as well as the forward pass, we can deduce how 
many of the paths go through each node; and if we divide that by the total 
number of paths, we obtain the probability that a randomly selected path 
passes through that node. Figure 16.10 shows the backward-passing messages 
in the lower-right corners of the tables, and the original forward-passing mes- 
sages in the upper-left corners. By multiplying these two numbers at a given 
vertex, we find the total number of paths passing through that vertex. For 
example, four paths pass through the central vertex. 

Figure 16.11 shows the result of this computation for the triangular 41 x 
41 grid. The area of each blob is proportional to the probability of passing 
through the corresponding node. 

Random path sampling 

Exercise 16.1. [1 ' p ' 2471 If one creates a 'random' path from A to B by nipping 
a fair coin at every junction where there is a choice of two directions, is 
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A. 

N 




'b 

Figure 16.7. Every path from A to 
P enters P through an upstream 
neighbour of P, cither M or N; so 
we can find the number of paths 
from A to P by adding the 
number of paths from A to M to 
the number from A to N. 
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Figure 16.8. Messages sent in the 
forward pass. 

A . 1 



B 

Figure 16.9. The five paths. 




Figure 16.10. Messages sent in the 
forward and backward passes. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



16.3: Finding the lowest-cost path 



245 



the resulting path a uniform random sample from the set of all paths? 
[Hint: imagine trying it for the grid of figure 16.8.] 

There is a neat insight to be had here, and I'd like you to have the satisfaction 
of figuring it out. 

Exercise 16.2. ^' P- 24 ^l Having run the forward and backward algorithms be- 
tween points A and B on a grid, how can one draw one path from A to 
B uniformly at random? (Figure 16.11.) 



(a) 



(b) 




Figure 16.11. (a) The probability 
of passing through each node, and 
(b) a randomly chosen path. 



The message-passing algorithm we used to count the paths to B is an 
example of the sum-product algorithm. The 'sum' takes place at each node 
when it adds together the messages coming from its predecessors; the 'product' 
was not mentioned, but you can think of the sum as a weighted sum in which 
all the summed terms happened to have weight 1. 



► 16.3 Finding the lowest-cost path 

Imagine you wish to travel as quickly as possible from Ambridge (A) to Bognor 
(B). The various possible routes are shown in figure 16.12, along with the cost 
in hours of traversing each edge in the graph. For example, the route A-I— L— 
N-B has a cost of 8 hours. We would like to find the lowest-cost path without 
explicitly evaluating the cost of all paths. We can do this efficiently by finding 
for each node what the cost of the lowest-cost path to that node from A is. 
These quantities can be computed by message-passing, starting from node A. 
The message-passing algorithm is called the min-sum algorithm or Viterbi 
algorithm. 

For brevity, we'll call the cost of the lowest-cost path from node A to 
node x 'the cost of x\ Each node can broadcast its cost to its descendants 
once it knows the costs of all its possible predecessors. Let's step through the 
algorithm by hand. The cost of A is zero. We pass this news on to H and I. 
As the message passes along each edge in the graph, the cost of that edge is 
added. We find the costs of H and I are 4 and 1 respectively (figure 16.13a). 
Similarly then, the costs of J and L are found to be 6 and 2 respectively, but 
what about K? Out of the edge H-K comes the message that a path of cost 5 
exists from A to K via H; and from edge I-K we learn of an alternative path 
of cost 3 (figure 16.13b). The min-sum algorithm sets the cost of K equal 
to the minimum of these (the 'min'), and records which was the smallest-cost 
route into K by retaining only the edge I-K and pruning away the other edges 
leading to K (figure 16.13c). Figures 16.13d and e show the remaining two 
iterations of the algorithm which reveal that there is a path from A to B with 
cost 6. [If the min-sum algorithm encounters a tie, where the minimum-cost 



K 



L^3 



N^3 



Figure 16.12. Route diagram from 
Ambridge to Bognor, showing the 
costs associated with the edges. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



246 



16 — Message Passing 



path to a node is achieved by more than one route to it, then the algorithm 
can pick any of those routes at random.] 

We can recover this lowest-cost path by backtracking from B, following 
the trail of surviving edges back to A. We deduce that the lowest-cost path is 
A-I-K-M-B. 

Other applications of the min-sum algorithm 

Imagine that you manage the production of a product from raw materials 
via a large set of operations. You wish to identify the critical path in your 
process, that is, the subset of operations that are holding up production. If 
any operations on the critical path were carried out a little faster then the 
time to get from raw materials to product would be reduced. 

The critical path of a set of operations can be found using the min-sum 
algorithm. 

In Chapter 25 the min-sum algorithm will be used in the decoding of 
error-correcting codes. 

► 16.4 Summary and related ideas 

Some global functions have a separability property. For example, the number 
of paths from A to P separates into the sum of the number of paths from A to M 
(the point to P's left) and the number of paths from A to N (the point above 
P). Such functions can be computed efficiently by message-passing. Other 
functions do not have such separability properties, for example 

1. the number of pairs of soldiers in a troop who share the same birthday; 

2. the size of the largest group of soldiers who share a common height 
(rounded to the nearest centimetre); 

3. the length of the shortest tour that a travelling salesman could take that 
visits every soldier in a troop. 

One of the challenges of machine learning is to find low-cost solutions to prob- 
lems like these. The problem of finding a large subset variables that are ap- 
proximately equal can be solved with a neural network approach (Hopfield and 
Brody, 2000; Hopfield and Brody, 2001). A neural approach to the travelling 
salesman problem will be discussed in section 42.9. 

► 16.5 Further exercises 

> Exercise 16.3. ^ Describe the asymptotic properties of the probabilities de- 

picted in figure 16.11a, for a grid in a triangle of width and height N. 

> Exercise 16.4.^ In image processing, the integral image I(x, y) obtained from 

an image f(x,y) (where x and y are pixel coordinates) is defined by 



x y 

EE 

u=0 v=0 



f(u,v). 



(16.1) 



Show that the integral image I(x, y) can be efficiently computed by mes- 
sage passing. 

Show that, from the integral image, some simple functions of the image 
can be obtained. For example, give an expression for the sum of the 
image intensities f(x, y) for all {x, y) in a rectangular region extending 
from {x x ,yi) to (x 2 ,y 2 ). 
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Figure 16.13. Min-sum 
message-passing algorithm to find 
the cost of getting to each node, 
and thence the lowest cost route 
from A to B. 
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16.6: Solutions 

► 16.6 Solutions 

Solution to exercise 16.1 (p. 244). Since there are five paths through the grid 
of figure 16.8, they must all have probability 1/5. But a strategy based on fair 
coin-flips will produce paths whose probabilities are powers of 1/2. 

Solution to exercise 16.2 (p. 245). To make a uniform random walk, each for- 
ward step of the walk should be chosen using a different biased coin at each 
junction, with the biases chosen in proportion to the backward messages ema- 
nating from the two options. For example, at the first choice after leaving A, 
there is a '3' message coming from the East, and a '2' coming from South, so 
one should go East with probability 3/5 and South with probability 2/5. This 
is how the path in figure 16.11 was generated. 
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17 



Communication over Constrained 
Noiseless Channels 



In this chapter we study the task of communicating efficiently over a con- 
strained noiseless channel - a constrained channel over which not all strings 
from the input alphabet may be transmitted. 

We make use of the idea introduced in Chapter 16, that global properties 
of graphs can be computed by a local message-passing algorithm. 



► 17.1 Three examples of constrained binary channels 

A constrained channel can be defined by rules that define which strings are 
permitted. 

Example 17.1. In Channel A every 1 must be followed by at least one 0. 
A valid string for this channel is 



00100101001010100010. 



(17.1) 



Channel A: 

the substring 11 is forbidden. 



As a motivation for this model, consider a channel in which Is are repre- 
sented by pulses of electromagnetic energy, and the device that produces 
those pulses requires a recovery time of one clock cycle after generating 
a pulse before it can generate another. 

Example 17.2. Channel B has the rule that all Is must come in groups of two 
or more, and all 0s must come in groups of two or more. 

A valid string for this channel is 



00111001110011000011. 



(17.2) 



Channel B: 

101 and 010 arc forbidden. 



As a motivation for this model, consider a disk drive in which succes- 
sive bits are written onto neighbouring points in a track along the disk 
surface; the values and 1 are represented by two opposite magnetic 
orientations. The strings 101 and 010 are forbidden because a single 
isolated magnetic domain surrounded by domains having the opposite 
orientation is unstable, so that 101 might turn into 111, for example. 

Example 17.3. Channel C has the rule that the largest permitted runlength is 
two, that is, each symbol can be repeated at most once. 

A valid string for this channel is 



Channel C: 

111 and 000 arc forbidden. 



10010011011001101001. 



(17.3) 
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17.1: Three examples of constrained binary channels 



A physical motivation for this model is a disk drive in which the rate of 
rotation of the disk is not known accurately, so it is difficult to distinguish 
between a string of two Is and a string of three Is, which are represented 
by oriented magnetizations of duration 2t and 3t respectively, where 
r is the (poorly known) time taken for one bit to pass by; to avoid 
the possibility of confusion, and the resulting loss of synchronization of 
sender and receiver, we forbid the string of three Is and the string of 
three 0s. 

All three of these channels are examples of runlength-limited channels. 
The rules constrain the minimum and maximum numbers of successive Is and 
0s. 



Channel 


Runlength of Is 


Runlength of 0s 




minimum 


maximum 


minimum 


maximum 


unconstrained 


1 


oo 


1 


oo 


A 


1 


1 


1 


oo 


B 


2 


oo 


2 


oo 


C 


1 


2 


1 


2 



In channel A, runs of 0s may be of any length but runs of Is are restricted to 
length one. In channel B all runs must be of length two or more. In channel 
C, all runs must be of length one or two. 

The capacity of the unconstrained binary channel is one bit per channel 
use. What are the capacities of the three constrained channels? [To be fair, 
we haven't defined the 'capacity' of such channels yet; please understand 'ca- 
pacity' as meaning how many bits can be conveyed reliably per channel-use.] 



Some codes for a constrained channel 

Let us concentrate for a moment on channel A, in which runs of 0s may be 
of any length but runs of Is are restricted to length one. We would like to 
communicate a random binary file over this channel as efficiently as possible. 

A simple starting point is a (2, 1) code that maps each source bit into two 
transmitted bits, C\. This is a rate-V2 code, and it respects the constraints of 
channel A, so the capacity of channel A is at least 0.5. Can we do better? 

C\ is redundant because if the first of two received bits is a zero, we know 00 

that the second bit will also be a zero. We can achieve a smaller average L ^ 

transmitted length using a code that omits the redundant zeroes in C\. 

C'2 is such a variable-length code. If the source symbols are used with Code C2 

equal frequency then the average transmitted length per source bit is ' 



so the average communication rate is 

R = 2 lz, (17.5) 

and the capacity of channel A must be at least 2 /3. 

Can we do better than C2? There are two ways to argue that the infor- 
mation rate could be increased above R = %. 

The first argument assumes we are comfortable with the entropy as a 
measure of information content. The idea is that, starting from code C'2, we 
can reduce the average message length, without greatly reducing the entropy 



Code Ci 



t 



L=-l + -2=-, (17.4) 

2 2 2' V ' 



1 10 
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of the message we send, by decreasing the fraction of Is that we transmit. 
Imagine feeding into C 2 a stream of bits in which the frequency of Is is /. [Such 
a stream could be obtained from an arbitrary binary file by passing the source 
file into the decoder of an arithmetic code that is optimal for compressing 
binary strings of density /.] The information rate R achieved is the entropy 
of the source, H 2 (f), divided by the mean transmitted length, 



Hf) = (l-f) + 2/ = ! + /. 



Thus 



R(f) 



H 2 (f) H 2 (f) 



(17.6) 



(17.7) 



Hf) 1 + / ' 

The original code C2, without preprocessor, corresponds to / = V2. What 
happens if we perturb / a little towards smaller /, setting 



(17.; 



for small negative 51 In the vicinity of / = V2, the denominator L(f) varies 
linearly with 5. In contrast, the numerator H 2 (f) only has a second-order 
dependence on 5. 

> Exercise 17.4. ^ Find, to order 5 2 , the Taylor expansion of H 2 (f) as a function 

of S. 

To first order, R(f) increases linearly with decreasing 5. It must be possible 
to increase R by decreasing /. Figure 17.1 shows these functions; R(f) does 
indeed increase as / decreases and has a maximum of about 0.69 bits per 
channel use at / ~ 0.38. 

By this argument we have shown that the capacity of channel A is at least 
maxj R(f) = 0.69. 

> Exercise 17.5.^' p - 257 l If a file containing a fraction / = 0.5 Is is transmitted 

by C 2 , what fraction of the transmitted stream is Is? 

What fraction of the transmitted bits is Is if we drive code C 2 with a 
sparse source of density / = 0.38? 

A second, more fundamental approach counts how many valid sequences 
of length N there are, Sjy. We can communicate log Sat bits in N channel 
cycles by giving one name to each of these valid sequences. 




Figure 17.1. Top: The information 
content per source symbol and 
mean transmitted length per 
source symbol as a function of the 
source density. Bottom: The 
information content per 
transmitted symbol, in bits, as a 
function of /. 



► 17.2 The capacity of a constrained noiseless channel 

We defined the capacity of a noisy channel in terms of the mutual information 
between its input and its output, then we proved that this number, the capac- 
ity, was related to the number of distinguishable messages S(N) that could be 
reliably conveyed over the channel in N uses of the channel by 



C= lim — log S(N). 



(17.9) 



In the case of the constrained noiseless channel, we can adopt this identity as 
our definition of the channel's capacity. However, the name s, which, when 
we were making codes for noisy channels (section 9.6), ran over messages 
s = 1, . . . , S, is about to take on a new role: labelling the states of our channel; 
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Figure 17.2. (a) State diagram for 
channel A. (b) Trellis section, (c) 
Trellis, (d) Connection matrix. 




Sn+l 



B 
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1 
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11 




c 




10 
11 
110 
10 



Figure 17.3. State diagrams, trellis 
sections and connection matrices 
for channels B and C. 



so in this chapter we will denote the number of distinguishable messages of 
length N by M/v, and define the capacity to be: 



C= lim —log Mat. 



(17.10) 



Once we have figured out the capacity of a channel we will return to the 
task of making a practical code for that channel. 



► 17.3 Counting the number of possible messages 

First let us introduce some representations of constrained channels. In a state 
diagram, states of the transmitter are represented by circles labelled with the 
name of the state. Directed edges from one state to another indicate that 
the transmitter is permitted to move from the first state to the second, and a 
label on that edge indicates the symbol emitted when that transition is made. 
Figure 17.2a shows the state diagram for channel A. It has two states, and 
1. When transitions to state are made, a is transmitted; when transitions 
to state 1 are made, a 1 is transmitted; transitions from state 1 to state 1 are 
not possible. 

We can also represent the state diagram by a trellis section, which shows 
two successive states in time at two successive horizontal locations (fig- 
ure 17.2b). The state of the transmitter at time n is called s n . The set of 
possible state sequences can be represented by a trellis as shown in figure 17.2c. 
A valid sequence corresponds to a path through the trellis, and the number of 
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Mi =2 
1 



1 



Mi = 2 M 2 = 3 




-®- »® ® 

1 2 



Mi = 2 M 2 = 3 M 3 = 5 




Figure 17.4. Counting the number 
of paths in the trellis of channel 
A. The counts next to the nodes 
arc accumulated by passing from 
left to right across the trellises. 




Mi = 2 M 2 =3 M 3 = 5 M 4 = 8 M 5 = 13 M 6 = 21 M 7 = 34 M 8 = 55 
(b) Channel B 1 2 3 4 6 10 17 




1 1 1 2 4 7 11 17 




Figure 17.5. Counting the number of paths in the trellises of channels A, B, and C. We assume that at 
the start the first bit is preceded by 00, so that for channels A and B, any initial character 
is permitted, but for channel C, the first character must be a 1. 
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0.75 
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3.7 


0.74 


6 


21 


1.615 


4.4 


0.73 


7 
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1.619 


5.1 


0.73 


8 


55 


1.618 


5.8 


0.72 


9 


89 


1.618 


6.5 


0.72 


10 
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1.618 


7.2 


0.72 


11 
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1.618 
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0.71 
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377 


1.618 
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400 
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Figure 17.6. Counting the number 
of paths in the trellis of channel A. 



valid sequences is the number of paths. For the purpose of counting how many 
paths there are through the trellis, we can ignore the labels on the edges and 
summarize the trellis section by the connection matrix A, in which A ss i = 1 
if there is an edge from state s to s', and A ss > = otherwise (figure 17. 2d). 
Figure 17.3 shows the state diagrams, trellis sections and connection matrices 
for channels B and C. 

Let's count the number of paths for channel A by message-passing in its 
trellis. Figure 17.4 shows the first few steps of this counting process, and 
figure 17.5a shows the number of paths ending in each state after n steps for 
n = 1, . . . , 8. The total number of paths of length n, M n , is shown along the 
top. We recognize M n as the Fibonacci series. 

> Exercise 17.6. ^ -I Show that the ratio of successive terms in the Fibonacci series 
tends to the golden ratio, 

1 \/5 

7= + 2 =1.618. (17-11) 

Thus, to within a constant factor, scales as ~ j N as N — > oo, so the 
capacity of channel A is 

C = lim log 2 [constant • j N ] = log 2 7 = log 2 1.618 = 0.694. (17.12) 

How can we describe what we just did? The count of the number of paths 
is a vector c^"'; we can obtain c^™ +1 ' from c'™' using: 

c (™ +1 ) = AcW. (17.13) 

So 

c W=A Jv c(°), (17.14) 

where c' ' is the state count before any symbols are transmitted. In figure 17.5 
we assumed c^ ' = [0, 1] T , i.e., that either of the two symbols is permitted at 
the outset. The total number of paths is M n = ~^2 S cj 1 = • n. In the limit, 
c( w ) becomes dominated by the principal right-eigenvector of A. 

C W ^ cons tant • Afe^. (17.15) 
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Here, Ai is the principal eigenvalue of A. 

So to find the capacity of any constrained channel, all we need to do is find 
the principal eigenvalue, Ai, of its connection matrix. Then 



C = log 2 Ai. 



(17.16) 



► 17.4 Back to our model channels 



Comparing figure 17.5a and figures 17.5b and c it looks as if channels B and 
C have the same capacity as channel A. The principal eigenvalues of the three 
trellises are the same (the eigenvectors for channels A and B are given at the 
bottom of table C.4, p. 608). And indeed the channels are intimately related. 

Equivalence of channels A and B 

If we take any valid string s for channel A and pass it through an accumulator, 
obtaining t defined by: 



z\ (Dj z 
J + 



1 

z\ (d] zo- 1 

J I 



Figure 17.7. An accumulator and 
a differentiator. 



t n -l 



s„ mod 2 for n > 2, 



(17.17) 



then the resulting string is a valid string for channel B, because there are no 
lis in s, so there are no isolated digits in t. The accumulator is an invertible 
operator, so, similarly, any valid string t for channel B can be mapped onto a 
valid string s for channel A through the binary differentiator, 



si = ti 

s n = t n — t n _i mod 2 for n > 2. 



(17.18) 



Because + and — are equivalent in modulo 2 arithmetic, the differentiator is 
also a blurrer, convolving the source stream with the filter (1, 1). 
Channel C is also intimately related to channels A and B. 

l> Exercise 17. 7 M' P- 257 l What is the relationship of channel C to channels A 
and B? 

► 17.5 Practical communication over constrained channels 

OK, how to do it in practice? Since all three channels are equivalent, we can 
concentrate on channel A. 



Fixed-length solutions 

We start with explicitly-enumerated codes. The code in the table 17.8 achieves 
a rate of % = 0.6. 



s 


c(s) 


1 


00000 


2 


10000 


3 


01000 


4 


00100 


5 


00010 


6 


10100 


7 


01010 


8 


10010 



Table 17.8. A runlcngth-limitcd 
code for channel A. 



l> Exercise 17.8. ^' p ' 257 l Similarly, enumerate all strings of length 8 that end in 
the zero state. (There are 34 of them.) Hence show that we can map 5 
bits (32 source strings) to 8 transmitted bits and achieve rate 5 /8 = 0.625. 

What rate can be achieved by mapping an integer number of source bits 
to N = 16 transmitted bits? 
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Optimal variable-length solution 

The optimal way to convey information over the constrained channel is to find 
the optimal transition probabilities for all points in the trellis, Q s i\ s , and make 
transitions with these probabilities. 

When discussing channel A, we showed that a sparse source with density 
/ = 0.38, driving code C2, would achieve capacity. And we know how to 
make sparsifiers (Chapter 6): we design an arithmetic code that is optimal 
for compressing a sparse source; then its associated decoder gives an optimal 
mapping from dense (i.e., random binary) strings to sparse strings. 

The task of finding the optimal probabilities is given as an exercise. 

Exercise 17.9. ^ Show that the optimal transition probabilities Q can be found 
as follows. 

Find the principal right- and left-eigenvectors of A, that is the solutions 
of Ae( fi ) = Xe^ and A = Ae^) with largest eigenvalue A. Then 
construct a matrix Q whose invariant distribution is proportional to 
e\^e^ L \ namely 

e {L) A , 

Qs'\s = ^A 1 - (17-19) 
\el 

[Hint: exercise 16.2 (p. 245) might give helpful cross-fertilization here.] 

> Exercise 17.10. [5 ' p ' 258] Show that when sequences are generated using the op- 
timal transition probability matrix (17.19), the entropy of the resulting 
sequence is asymptotically log 2 A per symbol. [Hint: consider the condi- 
tional entropy of just one symbol given the previous one, assuming the 
previous one's distribution is the invariant distribution.] 

In practice, we would probably use finite-precision approximations to the 
optimal variable-length solution. One might dislike variable-length solutions 
because of the resulting unpredictability of the actual encoded length in any 
particular case. Perhaps in some applications we would like a guarantee that 
the encoded length of a source file of size N bits will be less than a given 
length such as N/(C + e). For example, a disk drive is easier to control if 
all blocks of 512 bytes are known to take exactly the same amount of disk 
real-estate. For some constrained channels we can make a simple modification 
to our variable-length encoding and offer such a guarantee, as follows. We 
find two codes, two mappings of binary strings to variable-length encodings, 
having the property that for any source string x, if the encoding of x under 
the first code is shorter than average, then the encoding of x under the second 
code is longer than average, and vice versa. Then to transmit a string x we 
encode the whole string with both codes and send whichever encoding has the 
shortest length, prepended by a suitably encoded single bit to convey which 
of the two codes is being used. 

t> Exercise 17.11. ^ 3< ^' P- 258 l How many valid sequences of length 8 starting with 
a are there for the run-length-limited channels shown in figure 17.9? 

What are the capacities of these channels? 

Using a computer, find the matrices Q for generating a random path 
through the trellises of the channel A, and the two run-length- limited 
channels shown in figure 17.9. 






1 











1 


1 


1 


1 






1 — 1 














1 














1 


1 


1 


1 


1 



Figure 17.9. State diagrams and 
connection matrices for channels 
with maximum runlcngths for Is 
equal to 2 and 3. 
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> Exercise 17.12. [3 ' p ' 258] Consider the run-length-limited channel in which any 
length of run of 0s is permitted, and the maximum run length of Is is a 
large number L such as nine or ninety. 

Estimate the capacity of this channel. (Give the first two terms in a 
series expansion involving L.) 

What, roughly, is the form of the optimal matrix Q for generating a 
random path through the trellis of this channel? Focus on the values of 
the elements Qi|rj, the probability of generating a 1 given a preceding 0, 
and Ql\l-\, the probability of generating a 1 given a preceding run of 
L—l Is. Check your answer by explicit computation for the channel in 
which the maximum runlength of Is is nine. 

► 17.6 Variable symbol durations 

We can add a further frill to the task of communicating over constrained 
channels by assuming that the symbols we send have different durations, and 
that our aim is to communicate at the maximum possible rate per unit time. 
Such channels can come in two flavours: unconstrained, and constrained. 

Unconstrained channels with variable symbol durations 

We encountered an unconstrained noiseless channel with variable symbol du- 
rations in exercise 6.18 (p. 125). Solve that problem, and you've done this 
topic. The task is to determine the optimal frequencies with which the sym- 
bols should be used, given their durations. 

There is a nice analogy between this task and the task of designing an 
optimal symbol code (Chapter 4). When we make an binary symbol code 
for a source with unequal probabilities pi, the optimal message lengths are 
l i = lo S2 Vpi, so 

Pi = 2~ l t. (17.20) 

Similarly, when we have a channel whose symbols have durations li (in some 
units of time), the optimal probability with which those symbols should be 
used is 

p* = 2- l3l \ (17.21) 
where /3 is the capacity of the channel in bits per unit time. 

Constrained channels with variable symbol durations 

Once you have grasped the preceding topics in this chapter, you should be 
able to figure out how to define and find the capacity of these, the trickiest 
constrained channels. 

Exercise 17.13. ^ A classic example of a constrained channel with variable 
symbol durations is the 'Morse' channel, whose symbols are 



the dot d, 

the dash D, 

the short space (used between letters in morse code) s, and 

the long space (used between words) S; 



the constraints are that spaces may only be followed by dots and dashes. 

Find the capacity of this channel in bits per unit time assuming (a) that 
all four symbols have equal durations; or (b) that the symbol durations 
are 2, 4, 3 and 6 time units respectively. 
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Exercise 17.14. ^ How well-designed is Morse code for English (with, say, the 
probability distribution of figure 2.1)? 

Exercise 17.15. How difficult is it to get DNA into a narrow tube? 

To an information theorist, the entropy associated with a constrained 
channel reveals how much information can be conveyed over it. In sta- 
tistical physics, the same calculations are done for a different reason: to 
predict the thermodynamics of polymers, for example. 

As a toy example, consider a polymer of length N that can either sit 
in a constraining tube, of width L, or in the open where there are no 
constraints. In the open, the polymer adopts a state drawn at random 
from the set of one dimensional random walks, with, say, 3 possible 
directions per step. The entropy of this walk is log 3 per step, i.e., a 
total of A/dog 3. [The free energy of the polymer is defined to be — kT 
times this, where T is the temperature.] In the tube, the polymer's one- 
dimensional walk can go in 3 directions unless the wall is in the way, so 
the connection matrix is, for example (if L = 10), 
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Now, what is the entropy of the polymer? What is the change in entropy 
associated with the polymer entering the tube? If possible, obtain an 
expression as a function of L. Use a computer to find the entropy of the 
walk for a particular value of L, e.g. 20, and plot the probability density 
of the polymer's transverse location in the tube. 

Notice the difference in capacity between two channels, one constrained 
and one unconstrained, is directly proportional to the force required to 
pull the DNA into the tube. 



Figure 17.10. Model of DNA 
squashed in a narrow tube. The 
DNA will have a tendency to pop 
out of the tube, because, outside 
the tube, its random walk has 
greater entropy. 



► 17.7 Solutions 

Solution to exercise 17.5 (p. 250). A file transmitted by C2 contains, on aver- 
age, one-third Is and two-thirds 0s. 

If /= 0.38, the fraction of Is is f/(l + f) = (7 - 1.0)/(2 7 - 1.0) = 0.2764. 

Solution to exercise 17.7 (p. 254). A valid string for channel C can be obtained 
from a valid string for channel A by first inverting it [1 — > 0; — > 1], then 
passing it through an accumulator. These operations are invertible, so any 
valid string for C can also be mapped onto a valid string for A. The only 
proviso here comes from the edge effects. If we assume that the first character 
transmitted over channel C is preceded by a string of zeroes, so that the first 
character is forced to be a 1 (figure 17.5c) then the two channels are exactly 
equivalent only if we assume that channel A's first character must be a zero. 

Solution to exercise 17.8 (p. 254). With N = 16 transmitted bits, the largest 
integer number of source bits that can be encoded is 10, so the maximum rate 
of a fixed length code with N = 16 is 0.625. 
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Solution to exercise 17.10 (p. 255). Let the invariant distribution be 

P(a) = ae^ei R \ (17.22) 

where a is a normalization constant. The entropy of St given St—i, assuming Here, as in Chapter 4, St denotes 
St-i comes from the invariant distribution, is 

H(S t \S t -i) = -J2P(s)P(s'\s) log P(s'\s) 

s,s' 



the ensemble whose random 
variable is the state s t . 



e (L) A , 



Xe 



(L) 



log 



e (L) A , 



Xe. 



(17.23) 
(17.24) 



= -E< 



X 



- [log + log A s , s - log A - log e< L >] . (17.25) 



Now, A s i s is either or 1, so the contributions from the terms proportional to 
A s i s log A 8 i s are all zero. So 

H{S t \S t -i) = IogA + -H^^^. e ^e^IogcW + 



a 
A 



(17.26) 



log A - *W log ef + f £ Ae^e^ log e W (17.27) 



log A. 



(17.28) 



Solution to exercise 17.11 (p. 255). The principal eigenvalues of the connection 
matrices of the two channels are 1.839 and 1.928. The capacities (log A) are 
0.879 and 0.947 bits. 

Solution to exercise 17.12 (p. 256). The channel is similar to the unconstrained 
binary channel; runs of length greater than L are rare if L is large, so we only 
expect weak differences from this channel; these differences will show up in 
contexts where the run length is close to L. The capacity of the channel is 
very close to one bit. 

A lower bound on the capacity is obtained by considering the simple 
variable-length code for this channel which replaces occurrences of the maxi- 
mum runlength string 111. ..1 by 111.. .10, and otherwise leaves the source file 
unchanged. The average rate of this code is 1/(1 + 2 ~ L ) because the invariant 
distribution will hit the 'add an extra zero' state a fraction 2~ L of the time. 

We can reuse the solution for the variable-length channel in exercise 6.18 
(p. 125). The capacity is the value of (3 such that the equation 



L+l 

Z(/3) = ^2-^ = l 
i=i 



(17.29) 



is satisfied. The L+ 1 terms in the sum correspond to the L+l possible strings 
that can be emitted, 0, 10, 110, ... , 11. . .10. The sum is exactly given by: 



Z((3) 



-P 



(2^) 



L+l 



2-/5 



(17.30) 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



17.7: Solutions 



259 



TV 



Here we 



used < 



a(r 



N+l 



1) 



n=0 



We anticipate that [3 should be a little less than 1 in order for Z{(3) to 
equal 1. Rearranging and solving approximately for [3, using ln(l + x) ~ x, 



2-( i+2 >/ln2. 



(17.31) 
(17.32) 



We evaluated the true capacities for L = 2 and L = 3 in an earlier exercise. 
The table compares the approximate capacity /3 with the true capacity for a 
selection of values of L. 

The element Q^ will be close to 1/2 (just a tiny bit larger), since in the 
unconstrained binary channel Qu = 1/2. When a run of length L — 1 has 
occurred, we effectively have a choice of printing 10 or 0. Let the probability of 
selecting 10 be /. Let us estimate the entropy of the remaining N characters 
in the stream as a function of /, assuming the rest of the matrix Q to have 
been set to its optimal value. The entropy of the next N characters in the 
stream is the entropy of the first bit, H 2 {f), plus the entropy of the remaining 
characters, which is roughly (N—l) bits if we select as the first bit and 
(N — 2) bits if 1 is selected. More precisely, if C is the capacity of the channel 
(which is roughly 1), 

H(ihe next N chars) ~ H 2 (f) + [(N - 1)(1 - /) + (N - 2)/] C 

= H 2 (f) + NC - fC ~ H 2 (f) + N - f. (17.33) 

Differentiating and setting to zero to find the optimal /, we obtain: 

W 1 I"/ o 



log 2 



/ 



1 



/ 



/ * 1/3. 



(17.34) 



The probability of emitting a 1 thus decreases from about 0.5 to about 1/3 as 
the number of emitted Is increases. 
Here is the optimal matrix: 



.3334 




















.4287 
.4669 





4841 













.4923 
.4963 













.4983 













.4993 




4998 



1 .6666 .5713 .5331 .5159 .5077 .5037 .5017 .5007 .5002 



(17.35) 



L 




True capacity 


2 


0.910 


0.879 


3 


0.955 


0.947 


4 


0.977 


0.975 


5 


0.9887 


0.9881 


6 


0.9944 


0.9942 


9 


0.9993 


0.9993 



Our rough theory works. 
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Crosswords and Codebreaking 



In this chapter we make a random walk through a few topics related to lan- 
guage modelling. 



► 18.1 Crosswords 

The rules of crossword-making may be thought of as defining a constrained 
channel. The fact that many valid crosswords can be made demonstrates that 
this constrained channel has a capacity greater than zero. 

There are two archetypal crossword formats. In a 'type A' (or American) 
crossword, every row and column consists of a succession of words of length 2 
or more separated by one or more spaces. In a 'type B' (or British) crossword, 
each row and column consists of a mixture of words and single characters, 
separated by one or more spaces, and every character lies in at least one word 
(horizontal or vertical). Whereas in a type A crossword every letter lies in a 
horizontal word and a vertical word, in a typical type B crossword only about 
half of the letters do so; the other half lie in one word only. 

Type A crosswords are harder to create than type B because of the con- 
straint that no single characters are permitted. Type B crosswords are gener- 
ally harder to solve because there are fewer constraints per character. 

Why are crosswords possible? 

If a language has no redundancy, then any letters written on a grid form a 
valid crossword. In a language with high redundancy, on the other hand, it 
is hard to make crosswords (except perhaps a small number of trivial ones) . 
The possibility of making crosswords in a language thus demonstrates a bound 
on the redundancy of that language. Crosswords are not normally written in 
genuine English. They are written in 'word-English', the language consisting 
of strings of words from a dictionary, separated by spaces. 

> Exercise 18. l.^ Estimate the capacity of word-English, in bits per character. 

[Hint: think of word-English as defining a constrained channel (Chapter 
17) and see exercise 6.18 (p. 125).] 

The fact that many crosswords can be made leads to a lower bound on the 
entropy of word-English. 

For simplicity, we now model word-English by Wenglish, the language in- 
troduced in section 4.1 which consists of W words all of length L. The entropy 
of such a language, per character, including inter-word spaces, is: 
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Figure 18.1. Crosswords of types 
A (American) and B (British). 
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log 2 W 
L + l ' 



(18.1) 
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We'll find that the conclusions we come to depend on the value of Hw and 
are not terribly sensitive to the value of L. Consider a large crossword of size 
S squares in area. Let the number of words be f w S and let the number of 
letter-occupied squares be f\S. For typical crosswords of types A and B made 
of words of length L, the two fractions f w and f\ have roughly the values in 
table 18.2. 

We now estimate how many crosswords there are of size S using our simple 
model of Wenglish. We assume that Wenglish is created at random by gener- 
ating W strings from a monogram (i.e., memoryless) source with entropy Hq. 
If, for example, the source used all A = 26 characters with equal probability 
then Hq = log 2 A = 4.7 bits. If instead we use Chapter 2's distribution then 
the entropy is 4.2. The redundancy of Wenglish stems from two sources: it 
tends to use some letters more than others; and there are only W words in 
the dictionary. 

Let's now count how many crosswords there are by imagining filling in 
the squares of a crossword at random using the same distribution that pro- 
duced the Wenglish dictionary and evaluating the probability that this random 
scribbling produces valid words in all rows and columns. The total number of 
typical fillings-in of the fiS squares in the crossword that can be made is 



The probability that one word of length L is validly filled-in is 

W 



2 LH ■ 



(18.2) 



(18.3) 



and the probability that the whole crossword, made of f w S words, is validly 
filled-in by a single typical in-filling is approximately 

(3 fwS . (18.4) 

So the log of the number of valid crosswords of size S is estimated to be 

log/^" s |T| = S[(h- f w L)H + f w logW} (18.5) 
= S[(h- f w L)H + f w (L + l)H w ], (18.6) 

which is an increasing function of S only if 

(fi-f w L)H o + f w (L + l)H w >0. (18.7) 

So arbitrarily many crosswords can be made only if there's enough words in 
the Wenglish dictionary that 



Hw > { { wL ~ f !^ H . 



f w (L + l) 

Plugging in the values of f\ and f w from table 18.2, we find the following 



(18.8) 



Crossword type 


A 


B 


Condition for crosswords 


H w > 1T+~T H 





If we set Hq = 4.2 bits and assume there are W = 4000 words in a normal 
English-speaker's dictionary, all with length L = 5, then we find that the 
condition for crosswords of type B is satisfied, but the condition for crosswords 
of type A is only just satisfied. This fits with my experience that crosswords 
of type A usually contain more obscure words. 
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L + l 


AL + 1 



Table 18.2. Factors /.„, and /i by 
which the number of words and 
number of letter-squares 
respectively are smaller than the 
total number of squares. 



This calculation underestimates 
the number of valid Wenglish 
crosswords by counting only 
crosswords filled with 'typical' 
strings. If the monogram 
distribution is non-uniform then 
the true count is dominated by 
'atypical' fillings-in, in which 
crossword-friendly words appear 
more often. 
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Further reading 

These observations about crosswords were first made by Shannon (1948); I 
learned about them from Wolf and Siegel (1998). The topic is closely related 
to the capacity of two-dimensional constrained channels. An example of a 
two-dimensional constrained channel is a two-dimensional bar-code, as seen 
on parcels. 

Exercise 18.2. ^ A two-dimensional channel is defined by the constraint that, 
of the eight neighbours of every interior pixel in an N x N rectangular 
grid, four must be black and four white. (The counts of black and white 
pixels around boundary pixels are not constrained.) A binary pattern 
satisfying this constraint is shown in figure 18.3. What is the capacity 
of this channel, in bits per pixel, for large N? 



Figure 18.3. A binary pattern in 
which every pixel is adjacent to 
four black and four white pixels. 



► 18.2 Simple language models 

The Zipf-Mandelbrot distribution 

The crudest model for a language is the monogram model, which asserts that 
each successive word is drawn independently from a distribution over words. 
What is the nature of this distribution over words? 

Zipf 's law (Zipf, 1949) asserts that the probability of the rth most probable 
word in a language is approximately 



P (r) = 



(18.9) 



where the exponent a has a value close to 1, and k is a constant. According 
to Zipf, a log-log plot of frequency versus word-rank should show a straight 
line with slope —a. 

Mandelbrot's (1982) modification of Zipf's law introduces a third param- 
eter v, asserting that the probabilities are given by 



P(r) 



(r + v) c 



(18.10) 



For some documents, such as Jane Austen's Emma, the Zipf-Mandelbrot dis- 
tribution fits well - figure 18.4. 

Other documents give distributions that are not so well fitted by a Zipf- 
Mandelbrot distribution. Figure 18.5 shows a plot of frequency versus rank for 
the MpjX source of this book. Qualitatively, the graph is similar to a straight 
line, but a curve is noticeable. To be fair, this source file is not written in 
pure English - it is a mix of English, maths symbols such as 'x\ and IATpjX 
commands. 




nf ormat ion 

bability 



Figure 18.4. Fit of the 

Zipf-Mandelbrot distribution 
(18.10) (curve) to the empirical 
frequencies of words in Jane 
Austen's Emma (dots). The fitted 
parameters are k = 0.56; v = 8.0; 
a = 1.26. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



18.2: Simple language models 



263 



0.1 
0.01 
0.001 
0.0001 
0.00001 



.the 

of 



alpha=1 



robability 

information 




hannon 
ayes 



100 1000 



alpha=10 



alpha=100 




- book 



10 100 1000 10000 



Figure 18.5. Log-log plot of 
frequency versus rank for the 
words in the I^TjTjX file of this 
book. 



Figure 18.6. Zipf plots for four 
'languages' randomly generated 
from Dirichlct processes with 
parameter a ranging from 1 to 
1000. Also shown is the Zipf plot 
for this book. 



The Dirichlet process 

Assuming we are interested in monogram models for languages, what model 
should we use? One difficulty in modelling a language is the unboundedness 
of vocabulary. The greater the sample of language, the greater the number 
of words encountered. A generative model for a language should emulate 
this property. If asked 'what is the next word in a newly-discovered work 
of Shakespeare?' our probability distribution over words must surely include 
some non-zero probability for words that Shakespeare never used before. Our 
generative monogram model for language should also satisfy a consistency 
rule called exchangeability. If we imagine generating a new language from 
our generative model, producing an ever-growing corpus of text, all statistical 
properties of the text should be homogeneous: the probability of finding a 
particular word at a given location in the stream of text should be the same 
everywhere in the stream. 

The Dirichlet process model is a model for a stream of symbols (which we 
think of as 'words') that satisfies the exchangeability rule and that allows the 
vocabulary of symbols to grow without limit. The model has one parameter 
a. As the stream of symbols is produced, we identify each new symbol by a 
unique integer w. When we have seen a stream of length F symbols, we define 
the probability of the next symbol in terms of the counts {F w } of the symbols 
seen so far thus: the probability that the next symbol is a new symbol, never 
seen before, is 

(18.11) 

F + a y ' 

The probability that the next symbol is symbol w is 



F 



(18.12) 



Figure 18.6 shows Zipf plots (i.e., plots of symbol frequency versus rank) for 
million-symbol 'documents' generated by Dirichlet process priors with values 
of a ranging from 1 to 1000. 

It is evident that a Dirichlet process is not an adequate model for observed 
distributions that roughly obey Zipf 's law. 
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Figure 18.7. Zipf plots for the 
words of two 'languages' 
generated by creating successive 
characters from a Dirichlet 
process with a = 2, and declaring 
one character to be the space 
character. The two curves result 
from two different choices of the 
space character. 



With a small tweak, however, Dirichlet processes can produce rather nice 
Zipf plots. Imagine generating a language composed of elementary symbols 
using a Dirichlet process with a rather small value of the parameter a, so that 
the number of reasonably frequent symbols is about 27. If we then declare 
one of those symbols (now called 'characters' rather than words) to be a space 
character, then we can identify the strings between the space characters as 
'words'. If we generate a language in this way then the frequencies of words 
often come out as very nice Zipf plots, as shown in figure 18.7. Which character 
is selected as the space character determines the slope of the Zipf plot - a less 
probable space character gives rise to a richer language with a shallower slope. 




► 18.3 Units of information content 

The information content of an outcome, x, whose probability is P(x), is defined 
to be 

h{x) = log T(xy (18J3) 

The entropy of an ensemble is an average information content, 

ff(X) = log (18.14) 

When we compare hypotheses with each other in the light of data, it is of- 
ten convenient to compare the log of the probability of the data under the 
alternative hypotheses, 

'log evidence for Hi' = logP(D| Hi), (18.15) 

or, in the case where just two hypotheses are being compared, we evaluate the 
'log odds', 

which has also been called the 'weight of evidence in favour of TL\. The 
log evidence for a hypothesis, log_P(D \ TLi) is the negative of the information 
content of the data D: if the data have large information content, given a hy- 
pothesis, then they are surprising to that hypothesis; if some other hypothesis 
is not so surprised by the data, then that hypothesis becomes more probable. 
'Information content', 'surprise value', and log likelihood or log evidence are 
the same thing. 

All these quantities are logarithms of probabilities, or weighted sums of 
logarithms of probabilities, so they can all be measured in the same units. 
The units depend on the choice of the base of the logarithm. 

The names that have been given to these units are shown in table 18.8. 
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Unit 


Expression that has those units 


bit 


log 2 p 


nat 


log e p 


ban 




deciban (db) 


101og 10 p 



Table 18.8. Units of measurement 
of information content. 



The bit is the unit that we use most in this book. Because the word 'bit' 
has other meanings, a backup name for this unit is the shannon. A byte is 
8 bits. A megabyte is 2 20 ~ 10 6 bytes. If one works in natural logarithms, 
information contents and weights of evidence are measured in nats. The most 
interesting units are the ban and the deciban. 

The history of the ban 

Let me tell you why a factor of ten in probability is called a ban. When Alan 
Turing and the other codebreakers at Bletchley Park were breaking each new 
day's Enigma code, their task was a huge inference problem: to infer, given 
the day's cyphertext, which three wheels were in the Enigma machines that 
day; what their starting positions were; what further letter substitutions were 
in use on the steckerboard; and, not least, what the original German messages 
were. These inferences were conducted using Bayesian methods (of course!), 
and the chosen units were decibans or half-decibans, the deciban being judged 
the smallest weight of evidence discernible to a human. The evidence in favour 
of particular hypotheses was tallied using sheets of paper that were specially 
printed in Banbury, a town about 30 miles from Bletchley. The inference task 
was known as Banburismus, and the units in which Banburismus was played 
were called bans, after that town. 

► 18.4 A taste of Banburismus 

The details of the code-breaking methods of Bletchley Park were kept secret 
for a long time, but some aspects of Banburismus can be pieced together. 
I hope the following description of a small part of Banburismus is not too 
inaccurate. 1 

How much information was needed? The number of possible settings of 
the Enigma machine was about 8 x 10 12 . To deduce the state of the machine, 
'it was therefore necessary to find about 129 decibans from somewhere', as 
Good puts it. Banburismus was aimed not at deducing the entire state of the 
machine, but only at figuring out which wheels were in use; the logic-based 
bombes, fed with guesses of the plaintext (cribs), were then used to crack what 
the settings of the wheels were. 

The Enigma machine, once its wheels and plugs were put in place, im- 
plemented a continually-changing permutation cypher that wandered deter- 
ministically through a state space of 26 3 permutations. Because an enormous 
number of messages were sent each day, there was a good chance that what- 
ever state one machine was in when sending one character of a message, there 
would be another machine in the same state while sending a particular char- 
acter in another message. Because the evolution of the machine's state was 
deterministic, the two machines would remain in the same state as each other 

1 I've been most helped by descriptions given by Tony Sale (http://www. 
codesandciphers.org.uk/lectures/) and by Jack Good (1979), who worked with Turing 
at Bletchley. 
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for the rest of the transmission. The resulting correlations between the out- 
puts of such pairs of machines provided a dribble of information-content from 
which Turing and his co-workers extracted their daily 129 decibans. 

How to detect that two messages came from machines with a common 
state sequence 

The hypotheses are the null hypothesis, Ho, which states that the machines 
are in different states, and that the two plain messages are unrelated; and the 
'match' hypothesis, Hi, which says that the machines are in the same state, 
and that the two plain messages are unrelated. No attempt is being made 
here to infer what the state of either machine is. The data provided are the 
two cyphertexts x and y; let's assume they both have length T and that the 
alphabet size is A (26 in Enigma). What is the probability of the data, given 
the two hypotheses? 

First, the null hypothesis. This hypothesis asserts that the two cyphertexts 
are given by 

x = xix 2 x 3 ... = ci(ui)c 2 (u 2 )c 3 {u 3 ) . . . (18.17) 



and 



y = 2/12/22/3 • • • = c'i(vi)c' 2 (v 2 )c' 3 (v 3 ) . . . , (18.18) 



where the codes Ct and c' t are two unrelated time- varying permutations of the 
alphabet, and uiu 2 u 3 . . . and vyv 2 v 3 . . . are the plaintext messages. An exact 
computation of the probability of the data (x, y) would depend on a language 
model of the plain text, and a model of the Enigma machine's guts, but if we 
assume that each Enigma machine is an ideal random time- varying permuta- 
tion, then the probability distribution of the two cyphertexts is uniform. All 
cyphertexts are equally likely. 

P(x, y I H ) = (^j for all x, y of length T. (18.19) 

What about H{1 This hypothesis asserts that a single time-varying permuta- 
tion ct underlies both 

x = xix 2 x 3 ... = ci(ui)c 2 (u 2 )c 3 (u 3 ) . . . (18.20) 

and 

y = 2/12/22/3 ■ • ■ = ci(vi)c 2 (v 2 )c 3 (v 3 ) .... (18.21) 

What is the probability of the data (x, y)? We have to make some assumptions 
about the plaintext language. If it were the case that the plaintext language 
was completely random, then the probability of uiu 2 u 3 . . . and viv 2 v 3 . . . would 
be uniform, and so would that of x and y, so the probability P(x,y \Hi) 
would be equal to P(x,y | Ho), and the two hypotheses Ho and Hi would be 
indistinguishable . 

We make progress by assuming that the plaintext is not completely ran- 
dom. Both plaintexts are written in a language, and that language has redun- 
dancies. Assume for example that particular plaintext letters are used more 
often than others. So, even though the two plaintext messages are unrelated, 
they are slightly more likely to use the same letters as each other; if Hi is true, 
two synchronized letters from the two cyphertexts are slightly more likely to 
be identical. Similarly, if a language uses particular bigrams and trigrams 
frequently, then the two plaintext messages will occasionally contain the same 
bigrams and trigrams at the same time as each other, giving rise, if Hi is true, 
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u LITTLE- JACK-HORNER-SAT-IN-THE-CORNER-EATING-A-CHRISTMAS-PIE — HE-PUT-IN-H 
v RIDE-A-C0CK-H0RSE-T0-BANBURY-CR0SS-T0-SEE-A-FINE-LADY-UP0N-A-WHITE-H0RSE 
matches: .*....*.. ****** .* * * * 



to a little burst of 2 or 3 identical letters. Table 18.9 shows such a coinci- 
dence in two plaintext messages that are unrelated, except that they are both 
written in English. 

The codebreakers hunted among pairs of messages for pairs that were sus- 
piciously similar to each other, counting up the numbers of matching mono- 
grams, bigrams, trigrams, etc. This method was first used by the Polish 
codebreaker Rejewski. 

Let's look at the simple case of a monogram language model and estimate 
how long a message is needed to be able to decide whether two machines 
are in the same state. I'll assume the source language is monogram-English, 
the language in which successive letters are drawn i.i.d. from the probability 
distribution {pi} of figure 2.1. The probability of x and y is nonuniform: 
consider two single characters, xt = ct{ut) and yt = Ct(vt); the probability 
that they are identical is 



Table 18.9. Two aligned pieces of 
English plaintext, u and v, with 
matches marked by *. Notice that 
there are twelve matches, 
including a run of six, whereas the 
expected number of matches in 
two completely random strings of 
length T = 74 would be about 3. 
The two corresponding 
cyphertexts from two machines in 
identical states would also have 
twelve matches. 



(18.22) 



We give this quantity the name m, for 'match probability'; for both English 
and German, m is about 2/26 rather than 1/26 (the value that would hold 
for a completely random language). Assuming that Ct is an ideal random 
permutation, the probability of xt and yt is, by symmetry, 



P(x u y t \Hi) = { (i- 



A(A-l) 



if x t = y t 
for x t ^y t - 



(18.23) 



Given a pair of cyphertexts x and y of length T that match in M places and 
do not match in N places, the log evidence in favour of Hi is then 



log 



P(x,y|Wo) 



(l-m) 

m/A A(A-i) 
Mlog — + Nlog y ' 



1/A 2 

Mlog mA + TV log 



1/A 2 
(1 - m)A 



(18.24) 
(18.25) 



A - I ' 

Every match contributes logmA in favour of Hi] every non-match contributes 
lo § (i A m)A in favour of Wo- 



Match probability for monogram- English 


m 


0.076 


Coincidental match probability 


1/A 


0.037 


log-evidence for Hi per match 


10 log 10 mA 


3.1 db 


log-evidence for Hi per non-match 




-0.18 db 



If there were M = 4 matches and N = 47 non-matches in a pair of length 
T = 51, for example, the weight of evidence in favour of Hi would be +4 
decibans, or a likelihood ratio of 2.5 to 1 in favour. 

The expected weight of evidence from a line of text of length T = 20 
characters is the expectation of (18.25), which depends on whether Hi or Ho 
is true. If Hi is true then matches are expected to turn up at rate m, and the 
expected weight of evidence is 1.4 decibans per 20 characters. If Hq is true 
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then spurious matches are expected to turn up at rate 1 /A, and the expected 
weight of evidence is —1.1 decibans per 20 characters. Typically, roughly 400 
characters need to be inspected in order to have a weight of evidence greater 
than a hundred to one (20 decibans) in favour of one hypothesis or the other. 

So, two English plaintexts have more matches than two random strings. 
Furthermore, because consecutive characters in English are not independent, 
the bigram and trigram statistics of English are nonuniform and the matches 
tend to occur in bursts of consecutive matches. [The same observations also 
apply to German.] Using better language models, the evidence contributed 
by runs of matches was more accurately computed. Such a scoring system 
was worked out by Turing and refined by Good. Positive results were passed 
on to automated and human-powered codebreakers. According to Good, the 
longest false-positive that arose in this work was a string of 8 consecutive 
matches between two machines that were actually in unrelated states. 

Further reading 

For further reading about Turing and Bletchley Park, see Hodges (1983) and 
Good (1979). For an in-depth read about cryptography, Schneier's (1996) 
book is highly recommended. It is readable, clear, and entertaining. 

► 18.5 Exercises 

t> Exercise 18.3. ^ Another weakness in the design of the Enigma machine, 
which was intended to emulate a perfectly random time- varying permu- 
tation, is that it never mapped a letter to itself. When you press Q, what 
comes out is always a different letter from Q. How much information per 
character is leaked by this design flaw? How long a crib would be needed 
to be confident that the crib is correctly aligned with the cyphertext? 
And how long a crib would be needed to be able confidently to identify 
the correct key? 

[A crib is a guess for what the plaintext was. Imagine that the Brits 
know that a very important German is travelling from Berlin to Aachen, 
and they intercept Enigma-encoded messages sent to Aachen. It is a 
good bet that one or more of the original plaintext messages contains 
the string OBERSTURMBANNFUEHRERXGRAFXHEINRICHXVONXWEIZSAECKER, 
the name of the important chap. A crib could be used in a brute-force 
approach to find the correct Enigma key (feed the received messages 
through all possible Engima machines and see if any of the putative 
decoded texts match the above plaintext). This question centres on the 
idea that the crib can also be used in a much less expensive manner: 
slide the plaintext crib along all the encoded messages until a perfect 
mismatch of the crib and the encoded message is found; if correct, this 
alignment then tells you a lot about the key] 
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Why have Sex? Information Acquisition 
and Evolution 

Evolution has been happening on earth for about the last 10 9 years. Un- 
deniably, information has been acquired during this process. Thanks to the 
tireless work of the Blind Watchmaker, some cells now carry within them all 
the information required to be outstanding spiders; other cells carry all the 
information required to make excellent octopuses. Where did this information 
come from? 

The entire blueprint of all organisms on the planet has emerged in a teach- 
ing process in which the teacher is natural selection: fitter individuals have 
more progeny, the fitness being defined by the local environment (including 
the other organisms). The teaching signal is only a few bits per individual: an 
individual simply has a smaller or larger number of grandchildren, depending 
on the individual's fitness. 'Fitness' is a broad term that could cover 

• the ability of an antelope to run faster than other antelopes and hence 
avoid being eaten by a lion; 

• the ability of a lion to be well-enough camouflaged and run fast enough 
to catch one antelope per day; 

• the ability of a peacock to attract a peahen to mate with it; 

• the ability of a peahen to rear many young simultaneously. 

The fitness of an organism is largely determined by its DNA - both the coding 
regions, or genes, and the non-coding regions (which play an important role 
in regulating the transcription of genes). We'll think of fitness as a function 
of the DNA sequence and the environment. 

How does the DNA determine fitness, and how does information get from 
natural selection into the genome? Well, if the gene that codes for one of an 
antelope's proteins is defective, that antelope might get eaten by a lion early 
in life and have only two grandchildren rather than forty. The information 
content of natural selection is fully contained in a specification of which off- 
spring survived to have children - an information content of at most one bit 
per offspring. The teaching signal does not communicate to the ecosystem 
any description of the imperfections in the organism that caused it to have 
fewer children. The bits of the teaching signal are highly redundant, because, 
throughout a species, unfit individuals who are similar to each other will be 
failing to have offspring for similar reasons. 

So, how many bits per generation are acquired by the species as a whole 
by natural selection? How many bits has natural selection succeeded in con- 
veying to the human branch of the tree of life, since the divergence between 
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Australopithecines and apes 4 000 000 years ago? Assuming a generation time 
of 10 years for reproduction, there have been about 400 000 generations of 
human precursors since the divergence from apes. Assuming a population of 
10 9 individuals, each receiving a couple of bits of information from natural 
selection, the total number of bits of information responsible for modifying 
the genomes of 4 million B.C. into today's human genome is about 8 x 10 14 
bits. However, as we noted, natural selection is not smart at collating the 
information that it dishes out to the population, and there is a great deal of 
redundancy in that information. If the population size were twice as great, 
would it evolve twice as fast? No, because natural selection will simply be 
correcting the same defects twice as often. 

John Maynard Smith has suggested that the rate of information acquisition 
by a species is independent of the population size, and is of order 1 bit per 
generation. This figure would allow for only 400 000 bits of difference between 
apes and humans, a number that is much smaller than the total size of the 
human genome - 6 x 10 9 bits. [One human genome contains about 3 x 10 9 
nucleotides.] It is certainly the case that the genomic overlap between apes 
and humans is huge, but is the difference that small? 

In this chapter, we'll develop a crude model of the process of information 
acquisition through evolution, based on the assumption that a gene with two 
defects is typically likely to be more defective than a gene with one defect, and 
an organism with two defective genes is likely to be less fit than an organism 
with one defective gene. Undeniably, this is a crude model, since real biological 
systems are baroque constructions with complex interactions. Nevertheless, 
we persist with a simple model because it readily yields striking results. 

What we find from this simple model is that 

1. John Maynard Smith's figure of 1 bit per generation is correct for an 
asexually-reproducing population; 

2. in contrast, if the species reproduces sexually, the rate of information 
acquisition can be as large as y/G bits per generation, where G is the 
size of the genome. 

We'll also find interesting results concerning the maximum mutation rate 
that a species can withstand. 

► 19.1 The model 

We study a simple model of a reproducing population of N individuals with 
a genome of size G bits: variation is produced by mutation or by recombina- 
tion (i.e., sex) and truncation selection selects the N fittest children at each 
generation to be the parents of the next. We find striking differences between 
populations that have recombination and populations that do not. 

The genotype of each individual is a vector x of G bits, each having a good 
state x g = 1 and a bad state x g = 0. The fitness F(x) of an individual is simply 
the sum of her bits: 

G 

F(x) = £> ff . (19.1) 

9=1 

The bits in the genome could be considered to correspond either to genes 
that have good alleles (x g = l) and bad alleles (x g = 0), or to the nucleotides 
of a genome. We will concentrate on the latter interpretation. The essential 
property of fitness that we are assuming is that it is locally a roughly linear 
function of the genome, that is, that there are many possible changes one 
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could make to the genome, each of which has a small effect on fitness, and 
that these effects combine approximately linearly. 
We define the normalized fitness /(x) = F(x)/G. 

We consider evolution by natural selection under two models of variation. 

Variation by mutation. The model assumes discrete generations. At each 
generation, t, every individual produces two children. The children's 
genotypes differ from the parent's by random mutations. Natural selec- 
tion selects the fittest N progeny in the child population to reproduce, 
and a new generation starts. 

[The selection of the fittest N individuals at each generation is known 
as truncation selection.] 

The simplest model of mutations is that the child's bits {x g } are in- 
dependent. Each bit has a small probability of being flipped, which, 
thinking of the bits as corresponding roughly to nucleotides, is taken to 
be a constant m, independent of x g . [If alternatively we thought of the 
bits as corresponding to genes, then we would model the probability of 
the discovery of a good gene, P(x g = — > x g = l), as being a smaller 
number than the probability of a deleterious mutation in a good gene, 
P(x g = l^x g = 0).} 

Variation by recombination (or crossover, or sex). Our organisms are 
haploid, not diploid. They enjoy sex by recombination. The N individ- 
uals in the population are married into M = N/2 couples, at random, 
and each couple has C children - with C = 4 children being our stan- 
dard assumption, so as to have the population double and halve every 
generation, as before. The C children's genotypes are independent given 
the parents'. Each child obtains its genotype z by random crossover of 
its parents' genotypes, x and y. The simplest model of recombination 
has no linkage, so that: 

z = ( x g with probability 1/2 
9 I Vg with probability 1/2. 

Once the MC progeny have been born, the parents pass away, the fittest 
N progeny are selected by natural selection, and a new generation starts. 

We now study these two models of variation in detail. 

► 19.2 Rate of increase of fitness 

Theory of mutations 

We assume that the genotype of an individual with normalized fitness / = F/G 
is subjected to mutations that flip bits with probability m. We first show that 
if the average normalized fitness / of the population is greater than 1/2, then 
the optimal mutation rate is small, and the rate of acquisition of information 
is at most of order one bit per generation. 

Since it is easy to achieve a normalized fitness of / = 1/2 by simple muta- 
tion, we'll assume / > 1/2 and work in terms of the excess normalized fitness 
5f = / — 1/2. If an individual with excess normalized fitness Sf has a child 
and the mutation rate m is small, the probability distribution of the excess 
normalized fitness of the child has mean 



<5/ child = (1 - 2m)Sf 



(19.3) 
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and variance 

m(l - to) ^ to 

G ~ G' 1 ' 

If the population of parents has mean Sf(t) and variance cr 2 (t) = f3m/G, then 
the child population, before selection, will have mean (1 — 2m)6f(t) and vari- 
ance (l + /3)m/G. Natural selection chooses the upper half of this distribution, 
so the mean fitness and variance of fitness at the next generation are given by 



6f(t+l) = {l-2m)5f{t)+a^TW)^- 1 (19-5) 

4+l)= 7 (l + i3)^, (19.6) 

where a is the mean deviation from the mean, measured in standard devia- 
tions, and 7 is the factor by which the child distribution's variance is reduced 
by selection. The numbers a and 7 are of order 1. For the case of a Gaussian 
distribution, a = \/2/tt ~ 0.8 and 7 = (1 — 2/V) ~ 0.36. If we assume that 
the variance is in dynamic equilibrium, i.e., a (i+1) ~ a 2 {t), then 

7 (1 + /?)=/?, so (l + /3) = — !— , (19.7) 

1-7 

and the factor ot\j (1 + 0) in equation (19.5) is equal to 1, if we take the results 
for the Gaussian distribution, an approximation that becomes poorest when 
the discreteness of fitness becomes important, i.e., for small to. The rate of 
increase of normalized fitness is thus: 

which, assuming G(5f) 3> 1, is maximized for 

m opt = ) ,, , (19.9) 



IQG(Sf) 

at which point, 

( d!J opt ~ 8GW) 
So the rate of increase of fitness F = fG is at most 
dF 



(19.10) 



dt 8(Sf) 



per generation. (19.11) 



For a population with low fitness (Sf < 0.125), the rate of increase of fitness 
may exceed 1 unit per generation. Indeed, if Sf < 1/ y/G, the rate of increase, if 
m = V2, is of order \/G; this initial spurt can last only of order \/G generations. 
For 6f > 0.125, the rate of increase of fitness is smaller than one per generation. 
As the fitness approaches G, the optimal mutation rate tends to m= 1/(4G), so 
that an average of 1/4 bits are flipped per genotype, and the rate of increase of 
fitness is also equal to 1/4; information is gained at a rate of about 0.5 bits per 
generation. It takes about 2G generations for the genotypes of all individuals 
in the population to attain perfection. 
For fixed to, the fitness is given by 

(19.12) 
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No sex 



Histogram of parents' fitness 



Histogram of children's fitness 



Selected children's fitness 




Sex 



subject to the constraint Sf(t) < 1/2, where c is a constant of integration, 
equal to 1 if /(0) = 1/2. If the mean number of bits flipped per genotype, 
mG, exceeds 1, then the fitness F approaches an equilibrium value F eqin = 
(1/2 + l/(2VmG))G. 

This theory is somewhat inaccurate in that the true probability distribu- 
tion of fitness is non-Gaussian, asymmetrical, and quantized to integer values. 
All the same, the predictions of the theory are not grossly at variance with 
the results of simulations described below. 

Theory of sex 

The analysis of the sexual population becomes tractable with two approxi- 
mations: first, we assume that the gene-pool mixes sufficiently rapidly that 
correlations between genes can be neglected; second, we assume homogeneity, 
i.e., that the fraction f g of bits g that are in the good state is the same, f(t), 
for all g. 

Given these assumptions, if two parents of fitness F = fG mate, the prob- 
ability distribution of their children's fitness has mean equal to the parents' 
fitness, F; the variation produced by sex does not reduce the average fitness. 
The standard deviation of the fitness of the children scales as \jGf{\ — /). 
Since, after selection, the increase in fitness is proportional to this standard 
deviation, the fitness increase per generation scales as the square root of the 
size of the genome, \/~G. As shown in box 19.2, the mean fitness F = fG 
evolves in accordance with the differential equation: 



¥L^r,y/f{t){l-f{t))G, 
where r) = \/2/(ir + 2). The solution of this equation is 



(19.13) 



fit) = J 



sin 



VG 



, for t + c e (-§VG/?7,§ VG/v), (19-14) 



where c is a constant of integration, c = sin _1 (2/(0) — 1). So this idealized 
system reaches a state of eugenic perfection (/ = 1) within a finite time: 
(n/r])VG generations. 



Figure 19.1. Why sex is better 
than sex-free reproduction. If 
mutations are used to create 
variation among children, then it 
is unavoidable that the average 
fitness of the children is lower 
than the parents' fitness; the 
greater the variation, the greater 
the average deficit. Selection 
bumps up the mean fitness again. 
In contrast, recombination 
produces variation without a 
decrease in average fitness. The 
typical amount of variation scales 
as vG, where G is the genome 
size, so after selection, the average 
fitness rises by 0{\/G). 



Simulations 

Figure 19.3a shows the fitness of a sexual population of N = 1000 individ- 
uals with a genome size of G = 1000 starting from a random initial state 
with normalized fitness 0.5. It also shows the theoretical curve f(t)G from 
equation (19.14), which fits remarkably well. 

In contrast, figures 19.3(b) and (c) show the evolving fitness when variation 
is produced by mutation at rates m = 0.25/G and m = 6/G respectively. Note 
the difference in the horizontal scales from panel (a). 
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How does /(t+1) depend on /(t)? Let's first assume the two parents of a child both 
have exactly f(t)G good bits, and, by our homogeneity assumption, that those bits are 
independent random subsets of the G bits. The number of bits that are good in both 
parents is roughly f(t) 2 G, and the number that are good in one parent only is roughly 
2 /(*)(l-/(*))G, so the fitness of the child will be f(t) 2 G plus the sum of 2/(t)(l-/(t))G 
fair coin flips, which has a binomial distribution of mean /(t)(l — f(t))G and variance 
|/(t)(l — f(t))G. The fitness of a child is thus roughly distributed as 



-Fdiiid ~ Normal ^mean = 



f(t)G, variance = -/(t)(l - f(t))G 



The important property of this distribution, contrasted with the distribution under 
mutation, is that the mean fitness is equal to the parents' fitness; the variation produced 
by sex does not reduce the average fitness. 

If we include the parental population's variance, which we will write as cr 2 (t) = 
0{i)\f{t){l - /(t))G, the children's fitnesses are distributed as 

-Fchild ~ Normal ^mean = f(t)G, variance = 

Natural selection selects the children on the upper side of this distribution. The mean 
increase in fitness will be 



F(t+1) - F(t) = [a(l + l3/2) 1/2 /V2Wf(t)(l-f(t))G, 
and the variance of the surviving children will be 

( r 2 (i + l) = 7 (l + / 3/2)i/(i)(l-/(t))G, 

where a = and 7 = (1 — 2/ir). If there is dynamic equilibrium [o~ 2 (t + 1) = o~ 2 (tj\ 

then the factor in (19.2) is 



a(l+/3/2) 1/2 /V2 = 



(tt + 2) 



: 0.62. 



Defining this constant to be 77 = y'2/(7r + 2), we conclude that, under sex and natural 
selection, the mean fitness of the population increases at a rate proportional to the 
square root of the size of the genome, 

dF 



— j^- ~ r/^y f(t)(l — f(t))G bits per generation. 



Box 19.2. Details of the theory of 
sex. 
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Figure 19.3. Fitness as a function 
of time. The genome size is 
G = 1000. The dots show the 
fitness of six randomly selected 
individuals from the birth 
population at each generation. 
The initial population of 
N = 1000 had randomly 
generated genomes with 
/(0) = 0.5 (exactly), (a) Variation 
produced by sex alone. Line 
shows theoretical curve (19.14) for 
infinite homogeneous population. 
(b,c) Variation produced by 
mutation, with and without sex, 
when the mutation rate is 
mG = 0.25 (b) or 6 (c) bits per 
genome. The dashed line shows 
the curve (19.12). 
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Exercise 19.1. f 5 ' p - 280 l Dependence on population size. How do the results for 
a sexual population depend on the population size? We anticipate that 
there is a minimum population size above which the theory of sex is 
accurate. How is that minimum population size related to G? 

Exercise 19.2. ^ Dependence on crossover mechanism. In the simple model of 
sex, each bit is taken at random from one of the two parents, that is, we 
allow crossovers to occur with probability 50% between any two adjacent 
nucleotides. How is the model affected (a) if the crossover probability is 
smaller? (b) if crossovers occur exclusively at hot-spots located every d 
bits along the genome? 



► 19.3 The maximal tolerable mutation rate 



Figure 19.4. Maximal tolerable 
mutation rate, shown as number 
of errors per genome (mG), versus 
normalized fitness / = F/G. Left 
panel: genome size G = 1000; 
right: G= 100 000. 
Independent of genome size, a 
parthenogenetic species (no sex) 
can tolerate only of order 1 error 
per genome per generation; a 
species that uses recombination 
(sex) can tolerate far greater 
mutation rates. 



What if we combine the two models of variation? What is the maximum 
mutation rate that can be tolerated by a species that has sex? 
The rate of increase of fitness is given by 



dt 



~ -2m5f + r]V2 



m + /(l-/)/2 



G 



(19.15) 
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which is positive if the mutation rate satisfies 



m < Vy ^ G — • (19.16) 

Let us compare this rate with the result in the absence of sex, which, from 
equation (19.8), is that the maximum tolerable mutation rate is 

m < i^,. (19.17) 

The tolerable mutation rate with sex is of order \[G times greater than that 
without sex! 

A parthenogenetic (non-sexual) species could try to wriggle out of this 
bound on its mutation rate by increasing its litter sizes. But if mutation flips 
on average mG bits, the probability that no bits are flipped in one genome 
is roughly e~ mG , so a mother needs to have roughly e mG offspring in order 
to have a good chance of having one child with the same fitness as her. The 
litter size of a non-sexual species thus has to be exponential in mG (if mG is 
bigger than 1), if the species is to persist. 

So the maximum tolerable mutation rate is pinned close to 1 /G, for a non- 
sexual species, whereas it is a larger number of order 1/^/G, for a species with 
recombination. 

Turning these results around, we can predict the largest possible genome 
size for a given fixed mutation rate, m. For a parthenogenetic species, the 
largest genome size is of order 1/m, and for a sexual species, 1/m 2 . Taking 
the figure m = 10~ 8 as the mutation rate per nucleotide per generation (Eyre- 
Walker and Keightley, 1999), and allowing for a maximum brood size of 20 000 
(that is, mG ~ 10), we predict that all species with more than G = 10 9 coding 
nucleotides make at least occasional use of recombination. If the brood size is 
12, then this number falls to G = 2.5 x 10 8 . 

► 19.4 Fitness increase and information acquisition 

For this simple model it is possible to relate increasing fitness to information 
acquisition. 

If the bits are set at random, the fitness is roughly F = G/2. If evolution 
leads to a population in which all individuals have the maximum fitness F = G, 
then G bits of information have been acquired by the species, namely for each 
bit x g , the species has figured out which of the two states is the better. 

We define the information acquired at an intermediate fitness to be the 
amount of selection (measured in bits) required to select the perfect state 
from the gene pool. Let a fraction f g of the population have x g = l. Because 
log 2 (l//) is the information required to find a black ball in an urn containing 
black and white balls in the ratio / : 1 — /, we define the information acquired 
to be 

/ =E 1 °S2 I %bits. (19.18) 

9 ' 

If all the fractions f g are equal to F/G, then 

2F 

/ = Glog 2 — , (19.19) 

which is well approximated by 

I = 2{F - G/2). (19.20) 

The rate of information acquisition is thus roughly two times the rate of in- 
crease of fitness in the population. 
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19.5: Discussion 

► 19.5 Discussion 

These results quantify the well known argument for why species reproduce 
by sex with recombination, namely that recombination allows useful muta- 
tions to spread more rapidly through the species and allows deleterious muta- 
tions to be more rapidly cleared from the population (Maynard Smith, 1978; 
Felsenstein, 1985; Maynard Smith, 1988; Maynard Smith and Szathmary, 
1995). A population that reproduces by recombination can acquire informa- 
tion from natural selection at a rate of order \fG times faster than a partheno- 
genetic population, and it can tolerate a mutation rate that is of order \[G 
times greater. For genomes of size G ~ 10 8 coding nucleotides, this factor of 
\[G is substantial. 

This enormous advantage conferred by sex has been noted before by Kon- 
drashov (1988), but this meme, which Kondrashov calls 'the deterministic 
mutation hypothesis', does not seem to have diffused throughout the evolu- 
tionary research community, as there are still numerous papers in which the 
prevalence of sex is viewed as a mystery to be explained by elaborate mecha- 
nisms. 

'The cost of males ' - stability of a gene for sex or parthenogenesis 

Why do people declare sex to be a mystery? The main motivation for being 
mystified is an idea called the 'cost of males'. Sexual reproduction is disad- 
vantageous compared with asexual reproduction, it's argued, because of every 
two offspring produced by sex, one (on average) is a useless male, incapable 
of child-bearing, and only one is a productive female. In the same time, a 
parthenogenetic mother could give birth to two female clones. To put it an- 
other way, the big advantage of parthenogenesis, from the point of view of 
the individual, is that one is able to pass on 100% of one's genome to one's 
children, instead of only 50%. Thus if there were two versions of a species, one 
reproducing with and one without sex, the single mothers would be expected 
to outstrip their sexual cousins. The simple model presented thus far did not 
include either genders or the ability to convert from sexual reproduction to 
asexual, but we can easily modify the model. 

We modify the model so that one of the G bits in the genome determines 
whether an individual prefers to reproduce parthenogenetically (x= 1) or sex- 
ually (x = 0). The results depend on the number of children had by a single 
parthenogenetic mother, K p and the number of children born by a sexual 
couple, K s . Both (K p = 2, K S = A) and (K p = A, K S = A) are reasonable mod- 
els. The former (K p = 2, K s = 4) would seem most appropriate in the case 
of unicellular organisms, where the cytoplasm of both parents goes into the 
children. The latter (if p = 4, K S =A) is appropriate if the children are solely 
nurtured by one of the parents, so single mothers have just as many offspring 
as a sexual pair. I concentrate on the latter model, since it gives the greatest 
advantage to the parthenogens, who are supposedly expected to outbreed the 
sexual community. Because parthenogens have four children per generation, 
the maximum tolerable mutation rate for them is twice the expression (19.17) 
derived before for K p = 2. If the fitness is large, the maximum tolerable rate 
is raG ~ 2. 

Initially the genomes are set randomly with F = G/2, with half of the pop- 
ulation having the gene for parthenogenesis. Figure 19.5 shows the outcome. 
During the 'learning' phase of evolution, in which the fitness is increasing 
rapidly, pockets of parthenogens appear briefly, but then disappear within 
a couple of generations as their sexual cousins overtake them in fitness and 
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(b) mG = 1 Figure 19.5. Results when there is 

a gene for parthenogenesis, and no 
interbreeding, and single mothers 
produce as many children as 
sexual couples. G = 1000, 
N = 1000. (a) mG = 4; (b) 
mG = 1. Vertical axes show the 
fitnesses of the two 
sub-populations, and the 
percentage of the population that 
is parthcnogcnctic. 
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leave them behind. Once the population reaches its top fitness, however, the 
parthenogens can take over, if the mutation rate is sufficiently low (mG= 1). 

In the presence of a higher mutation rate (mG = 4), however, the 
parthenogens never take over. The breadth of the sexual population's fit- 
ness is of order \/G, so a mutant parthenogenetic colony arising with slightly 
above-average fitness will last for about VG/(mG) = l/{m\fG) generations 
before its fitness falls below that of its sexual cousins. As long as the popu- 
lation size is sufficiently large for some sexual individuals to survive for this 
time, sex will not die out. 

In a sufficiently unstable environment, where the fitness function is con- 
tinually changing, the parthenogens will always lag behind the sexual commu- 
nity. These results are consistent with the argument of Haldane and Hamilton 
(2002) that sex is helpful in an arms race with parasites. The parasites define 
an effective fitness function which changes with time, and a sexual population 
will always ascend the current fitness function more rapidly. 

Additive fitness function 

Of course, our results depend on the fitness function that we assume, and on 
our model of selection. Is it reasonable to model fitness, to first order, as a sum 
of independent terms? Maynard Smith (1968) argues that it is: the more good 
genes you have, the higher you come in the pecking order, for example. The 
directional selection model has been used extensively in theoretical popula- 
tion genetic studies (Bulmer, 1985). We might expect real fitness functions to 
involve interactions, in which case crossover might reduce the average fitness. 
However, since recombination gives the biggest advantage to species whose fit- 
ness functions are additive, we might predict that evolution will have favoured 
species that used a representation of the genome that corresponds to a fitness 
function that has only weak interactions. And even if there are interactions, 
it seems plausible that the fitness would still involve a sum of such interacting 
terms, with the number of terms being some fraction of the genome size G. 
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19.6: Further exercises 

Exercise 19.3.^^ Investigate how fast sexual and asexual species evolve if 
they have a fitness function with interactions. For example, let the fitness 
be a sum of exclusive-ors of pairs of bits; compare the evolving fitnesses 
with those of the sexual and asexual species with a simple additive fitness 
function. 

Furthermore, if the fitness function were a highly nonlinear function of the 
genotype, it could be made more smooth and locally linear by the Baldwin 
effect. The Baldwin effect (Baldwin, 1896; Hinton and Nowlan, 1987) has been 
widely studied as a mechanism whereby learning guides evolution, and it could 
also act at the level of transcription and translation. Consider the evolution of 
a peptide sequence for a new purpose. Assume the effectiveness of the peptide 
is a highly nonlinear function of the sequence, perhaps having a small island 
of good sequences surrounded by an ocean of equally bad sequences. In an 
organism whose transcription and translation machinery is flawless, the fitness 
will be an equally nonlinear function of the DNA sequence, and evolution 
will wander around the ocean making progress towards the island only by a 
random walk. In contrast, an organism having the same DNA sequence, but 
whose DNA-to-RNA transcription or RNA-to-protein translation is 'faulty', 
will occasionally, by mistranslation or mistranscription, accidentally produce a 
working enzyme; and it will do so with greater probability if its DNA sequence 
is close to a good sequence. One cell might produce 1000 proteins from the 
one mRNA sequence, of which 999 have no enzymatic effect, and one does. 
The one working catalyst will be enough for that cell to have an increased 
fitness relative to rivals whose DNA sequence is further from the island of 
good sequences. For this reason I conjecture that, at least early in evolution, 
and perhaps still now, the genetic code was not implemented perfectly but was 
implemented noisily, with some codons coding for a distribution of possible 
amino acids. This noisy code could even be switched on and off from cell 
to cell in an organism by having multiple aminoacyl-tRNA synthetases, some 
more reliable than others. 

Whilst our model assumed that the bits of the genome do not interact, 
ignored the fact that the information is represented redundantly, assumed 
that there is a direct relationship between phenotypic fitness and the genotype, 
and assumed that the crossover probability in recombination is high, I believe 
these qualitative results would still hold if more complex models of fitness and 
crossover were used: the relative benefit of sex will still scale as \/G. Only in 
small, in-bred populations are the benefits of sex expected to be diminished. 

In summary: Why have sex? Because sex is good for your bits! 

Further reading 

How did a high-information-content self-replicating system ever emerge in the 
first place? In the general area of the origins of life and other tricky ques- 
tions about evolution, I highly recommend Maynard Smith and Szathmary 
(1995), Maynard Smith and Szathmary (1999), Kondrashov (1988), May- 
nard Smith (1988), Ridley (2000), Dyson (1985), Cairns-Smith (1985), and 
Hopfield (1978). 

► 19.6 Further exercises 

Exercise 19.4.^ How good must the error-correcting machinery in DNA repli- 
cation be, given that mammals have not all died out long ago? Estimate the 
probability of nucleotide substitution, per cell division. [See Appendix C.4.] 
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Exercise 19.5. ^ Given that DNA replication is achieved by bumbling Brow- 
nian motion and ordinary thermodynamics in a biochemical porridge at a 
temperature of 35 C, it's astonishing that the error-rate of DNA replication 
is about 1CT 9 per replicated nucleotide. How can this reliability be achieved, 
given that the energetic difference between a correct base-pairing and an incor- 
rect one is only one or two hydrogen bonds and the thermal energy kT is only 
about a factor of four smaller than the free energy associated with a hydro- 
gen bond? If ordinary thermodynamics is what favours correct base-pairing, 
surely the frequency of incorrect base-pairing should be about 

/ = exp{- AE /kT), (19.21) 

where AE is the free energy difference, i.e., an error frequency of / ~ 1CP 4 ? 
How has DNA replication cheated thermodynamics? 

The situation is equally perplexing in the case of protein synthesis, which 
translates an mRNA sequence into a polypeptide in accordance with the ge- 
netic code. Two specific chemical reactions are protected against errors: the 
binding of tRNA molecules to amino acids, and the production of the polypep- 
tide in the ribosome, which, like DNA replication, involves base-pairing. 
Again, the fidelity is high (an error rate of about 10~ 4 ), and this fidelity 
can't be caused by the energy of the 'correct' final state being especially low 
- the correct polypeptide sequence is not expected to be significantly lower in 
energy than any other sequence. How do cells perform error correction? (See 
Hopfield (1974), Hopfield (1980)). 

Exercise 19.6. ^ While the genome acquires information through natural se- 
lection at a rate of a few bits per generation, your brain acquires information 
at a greater rate. 

Estimate at what rate new information can be stored in long term memory 
by your brain. Think of learning the words of a new language, for example. 

► 19.7 Solutions 

Solution to exercise 19.1 (p. 275). For small enough N, whilst the average fit- 
ness of the population increases, some unlucky bits become frozen into the 
bad state. (These bad genes are sometimes known as hitchhikers.) The ho- 
mogeneity assumption breaks down. Eventually, all individuals have identical 
genotypes that are mainly 1-bits, but contain some 0-bits too. The smaller 
the population, the greater the number of frozen 0-bits is expected to be. How 
small can the population size N be if the theory of sex is accurate? 

We find experimentally that the theory based on assuming homogeneity fits 
poorly only if the population size N is smaller than ~ \/~G. If N is significantly 
smaller than VG, information cannot possibly be acquired at a rate as big as 
Vg, since the information content of the Blind Watchmaker's decisions cannot 
be any greater than 2N bits per generation, this being the number of bits 
required to specify which of the 2N children get to reproduce. Baum et al. 
(1995), analyzing a similar model, show that the population size N should be 
about \/G(logG) 2 to make hitchhikers unlikely to arise. 
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Part IV 

Probabilities and Inference 
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About Part IV 



The number of inference problems that can (and perhaps should) be tackled 
by Bayesian inference methods is enormous. In this book, for example, we 
discuss the decoding problem for error-correcting codes, the task of inferring 
clusters from data, the task of interpolation through noisy data, and the task 
of classifying patterns given labelled examples. Most techniques for solving 
these problems can be categorized as follows. 

Exact methods compute the required quantities directly. Only a few inter- 
esting problems have a direct solution, but exact methods are important 
as tools for solving subtasks within larger problems. Methods for the 
exact solution of inference problems are the subject of Chapters 21, 24, 
25, and 26. 

Approximate methods can be subdivided into 

1. deterministic approximations, which include maximum likeli- 
hood (Chapter 22), Laplace's method (Chapters 27 and 28) and 
variational methods (Chapter 33); and 

2. Monte Carlo methods - techniques in which random numbers 
play an integral part - which will be discussed in Chapters 29, 30, 
and 32. 

This part of the book does not form a one-dimensional story. Rather, the 
ideas make up a web of interrelated threads which will recombine in subsequent 
chapters. 

Chapter 3, which is an honorary member of this part, discussed a range of 
simple examples of inference problems and their Bayesian solutions. 

To give further motivation for the toolbox of inference methods discussed in 
this part, Chapter 20 discusses the problem of clustering; subsequent chapters 
discuss the probabilistic interpretation of clustering as mixture modelling. 

Chapter 21 discusses the option of dealing with probability distributions 
by completely enumerating all hypotheses. Chapter 22 introduces the idea 
of maximization methods as a way of avoiding the large cost associated with 
complete enumeration, and points out reasons why maximum likelihood is 
not good enough. Chapter 23 reviews the probability distributions that arise 
most often in Bayesian inference. Chapters 24, 25, and 26 discuss another 
way of avoiding the cost of complete enumeration: marginalization. Chapter 
25 discusses message-passing methods appropriate for graphical models, using 
the decoding of error-correcting codes as an example. Chapter 26 combines 
these ideas with message-passing concepts from Chapters 16 and 17. These 
chapters are a prerequisite for the understanding of advanced error-correcting 
codes. 

Chapter 27 discusses deterministic approximations including Laplace's 
method. This chapter is a prerequisite for understanding the topic of complex- 
ity control in learning algorithms, an idea that is discussed in general terms 
in Chapter 28. 
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About Part IV 

Chapter 29 discusses Monte Carlo methods. Chapter 30 gives details of 
state-of-the-art Monte Carlo techniques. 

Chapter 31 introduces the Ising model as a test-bed for probabilistic meth- 
ods. An exact message-passing method and a Monte Carlo method are demon- 
strated. A motivation for studying the Ising model is that it is intimately 
related to several neural network models. Chapter 32 describes 'exact' Monte 
Carlo methods and demonstrates their application to the Ising model. 

Chapter 33 discusses variational methods and their application to Ising 
models and to simple statistical inference problems including clustering. This 
chapter will help the reader understand the Hopfield network (Chapter 42) 
and the EM algorithm, which is an important method in latent-variable mod- 
elling. Chapter 34 discusses a particularly simple latent variable model called 
independent component analysis. 

Chapter 35 discusses a ragbag of assorted inference topics. Chapter 36 
discusses a simple example of decision theory. Chapter 37 discusses differences 
between sampling theory and Bayesian methods. 

A theme: what inference is about 

A widespread misconception is that the aim of inference is to find the most 
probable explanation for some data. While this most probable hypothesis may 
be of interest, and some inference methods do locate it, this hypothesis is just 
the peak of a probability distribution, and it is the whole distribution that is 
of interest. As we saw in Chapter 4, the most probable outcome from a source 
is often not a typical outcome from that source. Similarly, the most probable 
hypothesis given some data may be atypical of the whole set of reasonably- 
plausible hypotheses. 

About Chapter 20 

Before reading the next chapter, exercise 2.17 (p. 36) and section 11.2 (inferring 
the input to a Gaussian channel) are recommended reading. 
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An Example Inference Task: Clustering 



Human brains are good at finding regularities in data. One way of expressing 
regularity is to put a set of objects into groups that are similar to each other. 
For example, biologists have found that most objects in the natural world 
fall into one of two categories: things that are brown and run away, and 
things that are green and don't run away. The first group they call animals, 
and the second, plants. We'll call this operation of grouping things together 
clustering. If the biologist further sub-divides the cluster of plants into sub- 
clusters, we would call this 'hierarchical clustering'; but we won't be talking 
about hierarchical clustering yet. In this chapter we'll just discuss ways to 
take a set of N objects and group them into K clusters. 

There are several motivations for clustering. First, a good clustering has 
predictive power. When an early biologist encounters a new green thing he has 
not seen before, his internal model of plants and animals fills in predictions for 
attributes of the green thing: it's unlikely to jump on him and eat him; if he 
touches it, he might get grazed or stung; if he eats it, he might feel sick. All of 
these predictions, while uncertain, are useful, because they help the biologist 
invest his resources (for example, the time spent watching for predators) well. 
Thus, we perform clustering because we believe the underlying cluster labels 
are meaningful, will lead to a more efficient description of our data, and will 
help us choose better actions. This type of clustering is sometimes called 
'mixture density modelling', and the objective function that measures how 
well the predictive model is working is the information content of the data, 
logl/P({x}). 

Second, clusters can be a useful aid to communication because they allow 
lossy compression. The biologist can give directions to a friend such as 'go to 
the third tree on the right then take a right turn' (rather than 'go past the 
large green thing with red berries, then past the large green thing with thorns, 
then . . .'). The brief category name 'tree' is helpful because it is sufficient to 
identify an object. Similarly, in lossy image compression, the aim is to convey 
in as few bits as possible a reasonable reproduction of a picture; one way to do 
this is to divide the image into N small patches, and find a close match to each 
patch in an alphabet of K image-templates; then we send a close fit to the 
image by sending the list of labels fci, fe, • • • , fcjV of the matching templates. 
The task of creating a good library of image-templates is equivalent to finding 
a set of cluster centres. This type of clustering is sometimes called 'vector 
quantization'. 

We can formalize a vector quantizer in terms of an assignment rule x — > 
fc(x) for assigning datapoints x to one of K codenames, and a reconstruction 
rule k — > mS k \ the aim being to choose the functions fc(x) and so as to 
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20.1: K-means clustering 
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minimize the expected distortion, which might be defined to be 

1 2 



D = 5>(*)i[m«*»-x] : 



(20.1) 



[The ideal objective function would be to minimize the psychologically per- 
ceived distortion of the image. Since it is hard to quantify the distortion 
perceived by a human, vector quantization and lossy compression are not so 
crisply defined problems as data modelling and lossless compression.] In vec- 
tor quantization, we don't necessarily believe that the templates {m' fc )} have 
any natural meaning; they are simply tools to do a job. We note in passing 
the similarity of the assignment rule (i.e., the encoder) of vector quantization 
to the decoding problem when decoding an error-correcting code. 

A third reason for making a cluster model is that failures of the cluster 
model may highlight interesting objects that deserve special attention. If 
we have trained a vector quantizer to do a good job of compressing satellite 
pictures of ocean surfaces, then maybe patches of image that are not well 
compressed by the vector quantizer are the patches that contain ships! If the 
biologist encounters a green thing and sees it run (or slither) away, this misfit 
with his cluster model (which says green things don't run away) cues him 
to pay special attention. One can't spend all one's time being fascinated by 
things; the cluster model can help sift out from the multitude of objects in 
one's world the ones that really deserve attention. 

A fourth reason for liking clustering algorithms is that they may serve 
as models of learning processes in neural systems. The clustering algorithm 
that we now discuss, the K-means algorithm, is an example of a competitive 
learning algorithm. The algorithm works by having the K clusters compete 
with each other for the right to own the data points. 



Figure 20.1. iV = 40 data points. 



► 20.1 K-means clustering 



The K-means algorithm is an algorithm for putting N data points in an I- 
dimensional space into K clusters. Each cluster is parameterized by a vector 
m( fe ) called its mean. 

The data points will be denoted by x("' where the superscript n runs from 
1 to the number of data points N. Each vector x is a vector with / components 
Xj. We will assume that the space that x lives in is a real space and that we 
have a metric that defines distances between points, for example, 



rf ( x >y) = oZl^ 



*) 5 



(20.2) 



About the name... As far as I 

know, the 'K' in K-means 
clustering simply refers to the 
chosen number of clusters. If 
Newton had followed the same 
naming policy, maybe wc would 
learn at school about 'calculus for 
the variable x'. It's a silly name, 
but we are stuck with it. 



To start the K-means algorithm (algorithm 20.2), the K means {m^ fc '} 
are initialized in some way, for example to random values. K-means is then 
an iterative two-step algorithm. In the assignment step, each data point n is 
assigned to the nearest mean. In the update step, the means are adjusted to 
match the sample means of the data points that they are responsible for. 

The K-means algorithm is demonstrated for a toy two-dimensional data set 
in figure 20.3, where 2 means are used. The assignments of the points to the 
two clusters are indicated by two point styles, and the two means are shown 
by the circles. The algorithm converges after three iterations, at which point 
the assignments are unchanged so the means remain unmoved when updated. 
The K-means algorithm always converges to a fixed point. 
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Initialization. Set K means {m^'} to random values. 

Assignment step. Each data point n is assigned to the nearest mean. 
We denote our guess for the cluster k^ that the point x("' belongs 
to by jfcW. 

fc(") = argmin{d(m (fc) ,x (ll) )}. (20.3) 

k 

An alternative, equivalent representation of this assignment of 
points to clusters is given by 'responsibilities', which are indicator 
variables In the assignment step, we set r[™' to one if mean k 

is the closest mean to datapoint x("); otherwise is zero. 



» 



r i if fc(" 

\ if U n 



= k 
■) + k. 



(20.4) 



What about ties? - We don't expect two means to be exactly the 
same distance from a data point, but if a tie does happen, is 
set to the smallest of the winning {k}. 

Update step. The model parameters, the means, are adjusted to match 
the sample means of the data points that they are responsible for. 



£ r W x (») 



m 



(fc) 



R( k ) 



where is the total responsibility of mean k, 



» 



(20.5) 



(20.6) 



Algorithm 20.2. The K-means 
clustering algorithm. 



What about means with no responsibilities? - If R^ k > = 0, then we 
leave the mean where it is. 

Repeat the assignment step and update step until the assign- 
ments do not change. 
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Figure 20.3. K-means algorithm 
applied to a data set of 40 points. 
K = 2 means evolve to stable 
locations after three iterations. 
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Figure 20.4. K-mcans algorithm 
applied to a data set of 40 points. 
Two separate runs, both with 
K = 4 means, reach different 
solutions. Each frame shows a 
successive assignment step. 




Exercise 20. lM' p ' ' See if you can prove that K-means always converges. 
[Hint: find a physical analogy and an associated Lyapunov function.] 

[A Lyapunov function is a function of the state of the algorithm that 
decreases whenever the state changes and that is bounded below. If a 
system has a Lyapunov function then its dynamics converge.] 

The K-means algorithm with a larger number of means, 4, is demonstrated in 
figure 20.4. The outcome of the algorithm depends on the initial condition. 
In the first case, after five iterations, a steady state is found in which the data 
points are fairly evenly split between the four clusters. In the second case, 
after six iterations, half the data points are in one cluster, and the others are 
shared among the other three clusters. 

Questions about this algorithm 

The K-means algorithm has several ad hoc features. Why does the update step 
set the 'mean' to the mean of the assigned points? Where did the distance d 
come from? What if we used a different measure of distance between x and m? 
How can we choose the 'best' distance? [In vector quantization, the distance 
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Figure 20.5. K-means algorithm 
for a case with two dissimilar 
clusters, (a) The "little 'n' large" 
data, (b) A stable set of 
assignments and means. Note that 
four points belonging to the broad 
cluster have been incorrectly 
assigned to the narrower cluster. 
(Points assigned to the right-hand 
cluster are shown by plus signs.) 
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Figure 20.6. Two elongated 
clusters, and the stable solution 
found by the K-means algorithm. 



function is provided as part of the problem definition; but I'm assuming we 
are interested in data-modelling rather than vector quantization.] How do we 
choose Kl Having found multiple alternative clusterings for a given K, how 
can we choose among them? 

Cases where K-means might be viewed as failing. 

Further questions arise when we look for cases where the algorithm behaves 
badly (compared with what the man in the street would call 'clustering'). 
Figure 20.5a shows a set of 75 data points generated from a mixture of two 
Gaussians. The right-hand Gaussian has less weight (only one fifth of the data 
points), and it is a less broad cluster. Figure 20.5b shows the outcome of using 
K-means clustering with K = 2 means. Four of the big cluster's data points 
have been assigned to the small cluster, and both means end up displaced 
to the left of the true centres of the clusters. The K-means algorithm takes 
account only of the distance between the means and the data points; it has 
no representation of the weight or breadth of each cluster. Consequently, data 
points that actually belong to the broad cluster are incorrectly assigned to the 
narrow cluster. 

Figure 20.6 shows another case of K-means behaving badly. The data 
evidently fall into two elongated clusters. But the only stable state of the 
K-means algorithm is that shown in figure 20.6b: the two clusters have been 
sliced in half! These two examples show that there is something wrong with 
the distance d in the K-means algorithm. The K-means algorithm has no way 
of representing the size or shape of a cluster. 

A final criticism of K-means is that it is a 'hard' rather than a 'soft' 
algorithm: points are assigned to exactly one cluster and all points assigned 
to a cluster are equals in that cluster. Points located near the border between 
two or more clusters should, arguably, play a partial role in determining the 
locations of all the clusters that they could plausibly be assigned to. But in 
the K-means algorithm, each borderline point is dumped in one cluster, and 
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20.2: Soft K-means clustering 289 

has an equal vote with all the other points in that cluster, and no vote in any 
other clusters. 



► 20.2 Soft K-means clustering 



These criticisms of K-means motivate the 'soft K-means algorithm', algo- 
rithm 20.7. The algorithm has one parameter, (3, which we could term the 
stiffness. 



Assignment step. Each data point x'"' is given a soft 'degree of as- 
signment' to each of the means. We call the degree to which x("' 
is assigned to cluster k the responsibility rjj. (the responsibility of 
cluster k for point n). 



(n)= exp(-/?d(mW,x("))) 
Tk E*'exp(-Mm(fc'),x(")))' 
The sum of the K responsibilities for the nth point is 1. 



(20.7) 



Update step. The model parameters, the means, are adjusted to match 
the sample means of the data points that they are responsible for. 



m 



(*) 



E 



» v (n) 



where is the total responsibility of mean k, 



» 



(20.8) 



(20.9) 



Algorithm 20.7. Soft K-means 
algorithm, version 1. 



Notice the similarity of this soft K-means algorithm to the hard K-means 

algorithm 20.2. The update step is identical; the only difference is that the 

(n) 

responsibilities r k can take on values between and 1. Whereas the assign- 
ment in the K-means algorithm involved a 'min' over the distances, the 
rule for assigning the responsibilities is a 'soft-min' (20.7). 

> Exercise 20.2. ^ Show that as the stiffness (3 goes to oo, the soft K-means algo- 
rithm becomes identical to the original hard K-means algorithm, except 
for the way in which means with no assigned points behave. Describe 
what those means do instead of sitting still. 

Dimensionally, the stiffness /3 is an inverse-length-squared, so we can as- 
sociate a lengthscale, a = 1 /%//?, with it. The soft K-means algorithm is 
demonstrated in figure 20.8. The lengthscale is shown by the radius of the 
circles surrounding the four means. Each panel shows the final fixed point 
reached for a different value of the lengthscale a. 



► 20.3 Conclusion 



At this point, we may have fixed some of the problems with the original K- 
means algorithm by introducing an extra complexity-control parameter (j. But 
how should we set (31 And what about the problem of the elongated clusters, 
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Figure 20.8. Soft K-means 
algorithm, version 1, applied to a 
data set of 40 points. K = 4. 
Implicit lengtliscale parameter 
a = 1//9 1 / 2 varied from a large to 
a small value. Each picture shows 
the state of all four means, with 
the implicit lengthscale shown by 
the radius of the four circles, after 
running the algorithm for several 
tens of iterations. At the largest 
lengtliscale, all four means 
converge exactly to the data 
mean. Then the four means 
separate into two groups of two. 
At shorter lengthscales, each of 
these pairs itself bifurcates into 
subgroups. 




small a 




and the clusters of unequal weight and width? Adding one stiffness parameter 
[1 is not going to make all these problems go away. 

We'll come back to these questions in a later chapter, as we develop the 
mixture-density-modelling view of clustering. 

Further reading 

For a vector-quantization approach to clustering see (Luttrell, 1989; Luttrell, 
1990). 

► 20.4 Exercises 

t> Exercise 20.3.^' p ' 291 l Explore the properties of the soft K-means algorithm, 
version 1, assuming that the datapoints {x} come from a single separable 
two-dimensional Gaussian distribution with mean zero and variances 
(var(a;i), var(x2)) = (<t 2 ,<t|), w ith a \ > °2- Set K = 2, assume N is 
large, and investigate the fixed points of the algorithm as /3 is varied. 
[Hint: assume that mS 1 ' = (m,0) and m^ 2 ' = (— m, 0).] 



i> bxercise 20.4. [3] Consider the soft K -means algorithm applied to a large 
amount of one-dimensional data that comes from a mixture of two equal- 
weight Gaussians with true means [i = ±1 and standard deviation a P , 
for example op = 1. Show that the hard K-means algorithm with K = 2 
leads to a solution in which the two means are further apart than the 
two true means. Discuss what happens for other values of (3, and find 
the value of /3 such that the soft algorithm puts the two means in the 
correct places. 
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► 20.5 Solutions 

Solution to exercise 20.1 (p. 287). We can associate an 'energy' with the state 
of the K- means algorithm by connecting a spring between each point and 
the mean that is responsible for it. The energy of one spring is proportional to 
its squared-length, namely j3d{x.^ , m( fe )) where (3 is the stiffness of the spring. 
The total energy of all the springs is a Lyapunov function for the algorithm, 
because (a) the assignment step can only decrease the energy - a point only 
changes its allegiance if the length of its spring would be reduced; (b) the 
update step can only decrease the energy - moving to the mean is the 

way to minimize the energy of its springs; and (c) the energy is bounded below 
- which is the second condition for a Lyapunov function. Since the algorithm 
has a Lyapunov function, it converges. 

Solution to exercise 20.3 (p. 290). If the means are initialized to mW = (to, 0) 
and mW = (— to,0), the assignment step for a point at location x\,X2 gives 



n(x) 



exp(-/3(x 1 - to) 2 /2) 



exp(-/3(a;i - to) 2 /2) + exp(-(3( Xl + m) 2 /2) 
1 



1 + exp(— 2f3mxi) ' 



and the updated to is 



J dxi P(xi) x\ ri(x) 
/ dxi P(xi)ri(x) 

2 / d.Ti P{xi) x x — t~" 

J 1 + exp(- 



2f3mxi) 



(20.10) 
(20.11) 

(20.12) 
(20.13) 



Now, to = is a fixed point, but the question is, is it stable or unstable? For 
tiny to (that is, (3o\m <C 1), we can Taylor expand 



1 



1 



1 + exp(-2/3TOCEi) 2 



(1 + (3mx\) 



so 



m 



J dxi P(x\) x\ (1 + (3mx\) 
o\(3m. 



(20.14) 

(20.15) 
(20.16) 



For small to, to either grows or decays exponentially under this mapping, 
depending on whether a\(3 is greater than or less than 1. The fixed point 
m = is stable if 

a\ < 1/(3 (20.17) 

and unstable otherwise. [Incidentally, this derivation shows that this result is 
general, holding for any true probability distribution P{xi) having variance 
a\ , not just the Gaussian.] 

If a\ > 1/(3 then there is a bifurcation and there are two stable fixed points 
surrounding the unstable fixed point at to = 0. To illustrate this bifurcation, 
figure 20.10 shows the outcome of running the soft K- means algorithm with 
(3 = 1 on one-dimensional data with standard deviation a\ for various values of 
o\. Figure 20.11 shows this pitchfork bifurcation from the other point of view, 
where the data's standard deviation <j\ is fixed and the algorithm's lengthscale 
a = 1/ (3 1 / 2 is varied on the horizontal axis. 




Figure 20.9. Schematic diagram of 
the bifurcation as the largest data 
variance aj increases from below 
1//3 1 / 2 to above 1//3 1 / 2 . The data 
variance is indicated by the 
ellipse. 




Figure 20.10. The stable mean 
locations as a function of a\ , for 
constant (3, found numerically 
(thick lines), and the 
approximation (20.22) (thin lines). 
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Figure 20.11. The stable mean 
locations as a function of 1//3 1 / 2 , 
for constant o\. 
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Here is a cheap theory to model how the fitted parameters ±m behave beyond 
the bifurcation, based on continuing the series expansion. This continuation 
of the series is rather suspect, since the series isn't necessarily expected to 
converge beyond the bifurcation point, but the theory fits well anyway. 

We take our analytic approach one term further in the expansion 

1 (1 + Prnxi - \(Pmx{f) + ■■■ (20.18) 



1 + exp(-2/3ma;i) 2 V 3 

then we can solve for the shape of the bifurcation to leading order, which 
depends on the fourth moment of the distribution: 

m' ~ J dzi P(zi)zi(l +/3mn - ^(/3mxi) 3 ) (20.19) 

= ffl/3m- ^{/3m) 3 3af. (20.20) 

[At (20.20) we use the fact that P{x\) is Gaussian to find the fourth moment.] 
This map has a fixed point at m such that 

al(3{\ - (Prnfaj) = 1, (20.21) 

i.e., 

m = ±r i/2 (^- 1 ) V2 (20 22) 

The thin line in figure 20.10 shows this theoretical approximation. Figure 20.10 
shows the bifurcation as a function of g\ for fixed /?; figure 20.11 shows the 
bifurcation as a function of 1//3 1 / 2 for fixed o\. 

x> Exercise 20.5. ^' p,292 l Why does the pitchfork in figure 20.11 tend to the val- 
ues ~±0.8 as 1//3 1 / 2 — > 0? Give an analytic expression for this asymp- 
tote. 

Solution to exercise 20.5 (p. 292). The asymptote is the mean of the rectified 
Gaussian, 

Jo°° Normal(a;, l)x dx 



f ., - /2/7r ~ 0.798. (20.23) 
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21 



Exact Inference by Complete 
Enumeration 

We open our toolbox of methods for handling probabilities by discussing a 
brute-force inference method: complete enumeration of all hypotheses, and 
evaluation of their probabilities. This approach is an exact method, and the 
difficulty of carrying it out will motivate the smarter exact and approximate 
methods introduced in the following chapters. 

► 21.1 The burglar alarm 

Bayesian probability theory is sometimes called 'common sense, amplified'. 
When thinking about the following questions, please ask your common sense 
what it thinks the answers are; we will then see how Bayesian methods confirm 
your everyday intuition. 

Example 21.1. Fred lives in Los Angeles and commutes 60 miles to work. 
Whilst at work, he receives a phone-call from his neighbour saying that 
Fred's burglar alarm is ringing. What is the probability that there was 
a burglar in his house today? While driving home to investigate, Fred 
hears on the radio that there was a small earthquake that day near his 
home. 'Oh', he says, feeling relieved, 'it was probably the earthquake 
that set off the alarm'. What is the probability that there was a burglar 
in his house? (After Pearl, 1988). 

Let's introduce variables b (a burglar was present in Fred's house today), 
a (the alarm is ringing), p (Fred receives a phonecall from the neighbour re- 
porting the alarm), e (a small earthquake took place today near Fred's house), 
and r (the radio report of earthquake is heard by Fred). The probability of 
all these variables might factorize as follows: 

P(b, e, a,p, r) = P(b)P(e)P(a | b, e)P(p | a)P(r | e), (21.1) 

and plausible values for the probabilities are: 

1. Burglar probability: 

P(b=l) = p, P(b = 0) = l-f3, (21.2) 
e.g., P = 0.001 gives a mean burglary rate of once every three years. 

2. Earthquake probability: 

P(e=l) = £ , P(e = 0) = l-e, (21.3) 



Earthquake Burglar 

o o 

/ \ / 

O O Alarm 

Radio ^\ 

o 

Phonecall 

Figure 21.1. Belief network for the 
burglar alarm problem. 
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21 — Exact Inference by Complete Enumeration 



with, e.g., e = 0.001; our assertion that the earthquakes are independent 
of burglars, i.e., the prior probability of 6 and e is P(b,e) = P(6)P(e), 
seems reasonable unless we take into account opportunistic burglars who 
strike immediately after earthquakes. 

3. Alarm ringing probability: we assume the alarm will ring if any of the 
following three events happens: (a) a burglar enters the house, and trig- 
gers the alarm (let's assume the alarm has a reliability of a/, = 0.99, i.e., 
99% of burglars trigger the alarm); (b) an earthquake takes place, and 
triggers the alarm (perhaps a e = 1% of alarms are triggered by earth- 
quakes?); or (c) some other event causes a false alarm; let's assume the 
false alarm rate / is 0.001, so Fred has false alarms from non-earthquake 
causes once every three years. [This type of dependence of a on b and e 
is known as a 'noisy-or'.] The probabilities of a given b and e are then: 



P(a = 0|6=0, e = 0) 
P(a = 0|6=l, e = 0) 
P( a = 0|6=0, e = l) 
P(o = 0|6=l, e = l) 

or, in numbers, 

P(a = 0|6=0, e = 0) 
P( a = 0|6=l, e = 0) 
P(a = 0|6=0, e = l) 
P(a = 0|6=l, e = l) 



(1-/), P( a =l|6=0,e = 0) 

(l-/)(l-a 6 ), P( a =l|6=l, e = 0) 

(l-/)(l-a e ), P(a = l\b = 0, e = l) 

(1 - /)(1 - a b )(l - a e ), P(a = l | b= 1, e= 1) 



0.999, 
0.009 99, 
0.989 01, 
0.009 8901, 



P(a = l\b = 0, e = 0) 
P(a=l |6=1, e = 0) 
P(a= 1 1 6 = 0, e=l) 
P(o = l|6 = l, e = l) 



0.001 
0.990 01 
0.010 99 
0.990109 9. 



We assume the neighbour would never phone if the alarm is not ringing 
[P(p=l\a = 0) = 0]; and that the radio is a trustworthy reporter too 
[P(r= 1 1 e= 0) = 0]; we won't need to specify the probabilities P(p= 1 1 a= 1) 
or P(r = 1 | e = 1) in order to answer the questions above, since the outcomes 
p = 1 and r = 1 give us certainty respectively that a= 1 and e= 1. 

We can answer the two questions about the burglar by computing the 
posterior probabilities of all hypotheses given the available information. Let's 
start by reminding ourselves that the probability that there is a burglar, before 
either p or r is observed, is P(6 = 1) = /? = 0.001, and the probability that an 
earthquake took place is P(e= 1) = e = 0.001, and these two propositions are 
independent. 

First, when p= 1, we know that the alarm is ringing: a=l. The posterior 
probability of b and e becomes: 

P(a=l|6,e)P(6)P(e) 



P(b, 



1) 



P(a=l) 



(21.4) 



l-(l-/)(l-a 6 ) 
l-(l-/)(l-a e ) 
l-(l-/)(l-a 6 )(l-a e ) 



The numerator's four possible values are 



P(a = l | 6=0, e = 0) x P(6 = 0) x P(e = 0) 
P(o= 1 1 6=1, e = 0) x P(6=l) x P(e = 0) 
P(o= 1 | 6=0, e=l) x P(6 = 0) x P(e=l) 
P(a= 1 | 6=1, e=l) x P(6=l) x P(e=l) 



0.001 x0.999x0.999 
0.99001 x0.001x0.999 
0.01099 x0.999x0.001 
0.9901099x0.001x0.001 



0.000 998 
0.000 989 
0.000 010 979 
9.9 x 10~ 7 . 



The normalizing constant is the sum of these four numbers, P(a = 1) = 0.002, 
and the posterior probabilities are 



P(6 = 
P(6 = 
P(b- 
P(6 = 



= 0, e = 0|o = l) 
= 1, e = 0|o = l) 
= 0, e=l \a = l) 
= 1, e =l\a=l) 



0.4993 
0.4947 
0.0055 
0.0005. 



(21.5) 
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21.2: Exact inference for continuous hypothesis spaces 

To answer the question, 'what's the probability a burglar was there?' we 
marginalize over the earthquake variable e: 

P(b = 0\a = l) = P{b = 0, e = 0|o = l) + P(b = 0, e=l|a=l) = 0.505 
P(6 = l|o = l) = P(b=l, e = 0|o = l) + P(6=l, e=l|o=l) = 0.495. 

(21-6) 

So there is nearly a 50% chance that there was a burglar present. It is impor- 
tant to note that the variables b and e, which were independent a priori, are 
now dependent. The posterior distribution (21.5) is not a separable function of 
b and e. This fact is illustrated most simply by studying the effect of learning 
that e = 1. 

When we learn e=l, the posterior probability of b is given by 
P(b\e= 1, a = l) = P(b,e = l\a = l)/P(e= 1 1 a = 1), i.e., by dividing the bot- 
tom two rows of (21.5), by their sum P(e = 1 | a = 1) = 0.0060. The posterior 
probability of b is: 

P(6 = 0|e = l,o=l) = 0.92 

P(6=l|e = l, a=l) = 0.08. K ' 

There is thus now an 8% chance that a burglar was in Fred's house. It is 
in accordance with everyday intuition that the probability that b= 1 (a pos- 
sible cause of the alarm) reduces when Fred learns that an earthquake, an 
alternative explanation of the alarm, has happened. 

Explaining away 

This phenomenon, that one of the possible causes (b=l) of some data (the 
data in this case being a = 1) becomes less probable when another of the causes 
(e= 1) becomes more probable, even though those two causes were indepen- 
dent variables a priori, is known as explaining away. Explaining away is an 
important feature of correct inferences, and one that any artificial intelligence 
should replicate. 

If we believe that the neighbour and the radio service are unreliable or 
capricious, so that we are not certain that the alarm really is ringing or that 
an earthquake really has happened, the calculations become more complex, 
but the explaining-away effect persists; the arrival of the earthquake report r 
simultaneously makes it more probable that the alarm truly is ringing, and 
less probable that the burglar was present. 

In summary, we solved the inference questions about the burglar by enu- 
merating all four hypotheses about the variables (6, e), finding their posterior 
probabilities, and marginalizing to obtain the required inferences about b. 

t> bxercise 21.2. [2] After Fred receives the phone-call about the burglar alarm, 
but before he hears the radio report, what, from his point of view, is the 
probability that there was a small earthquake today? 

► 21.2 Exact inference for continuous hypothesis spaces 

Many of the hypothesis spaces we will consider are naturally thought of as 
continuous. For example, the unknown decay length A of section 3.1 (p. 48) 
lives in a continuous one-dimensional space; and the unknown mean and stan- 
dard deviation of a Gaussian /!, a live in a continuous two-dimensional space. 
In any practical computer implementation, such continuous spaces will neces- 
sarily be discretized, however, and so can, in principle, be enumerated - at a 
grid of parameter values, for example. In figure 3.2 we plotted the likelihood 
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Figure 21.2. Enumeration of an 
entire (discretized) hypothesis 
space for one Gaussian with 
parameters ji (horizontal axis) 
and <7 (vertical). 




function for the decay length as a function of A by evaluating the likelihood 
at a finely-spaced series of points. 



A two-parameter model 



Let's look at the Gaussian distribution as an example of a model with a two- 
dimensional hypothesis space. The one-dimensional Gaussian distribution is 
parameterized by a mean \x and a standard deviation a: 



P(x\n,a) = 



exp 



2a 2 



= Normal(a;; n,cr 2 )- 



(21-8) 



Figure 21.2 shows an enumeration of one hundred hypotheses about the mean 
and standard deviation of a one-dimensional Gaussian distribution. These 
hypotheses are evenly spaced in a ten by ten square grid covering ten values 
of fi and ten values of a. Each hypothesis is represented by a picture showing 
the probability density that it puts on x. We now examine the inference of /n 
and a given data points x n , n = 1, . . . , N, assumed to be drawn independently 
from this density. 

Imagine that we acquire data, for example the five points shown in fig- 
ure 21.3. We can now evaluate the posterior probability of each of the one 
hundred subhypotheses by evaluating the likelihood of each, that is, the value 
of P({x n }^ =1 | fi, a). The likelihood values are shown diagrammatically in 
figure 21.4 using the line thickness to encode the value of the likelihood. Sub- 
hypotheses with likelihood smaller than e~ 8 times the maximum likelihood 
have been deleted. 

Using a finer grid, we can represent the same information by plotting the 
likelihood as a surface plot or contour plot as a function of [i and a (figure 21.5). 



Figure 21.3. Five datapoints 
{ x n}n=i- The horizontal 
coordinate is the value of the 
datum, x n ; the vertical coordinate 
has no meaning. 



A five-parameter mixture model 

Eyeballing the data (figure 21.3), you might agree that it seems more plau- 
sible that they come not from a single Gaussian but from a mixture of two 
Gaussians, defined by two means, two standard deviations, and two mixing 
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Figure 21.4. Likelihood function, 
given the data of figure 21.3, 
represented by line thickness. 
Subhypothcscs having likelihood 
smaller than e~ 8 times the 
maximum likelihood arc not 
shown. 



1 "te 

0.8 





Figure 21.5. The likelihood 
function for the parameters of a 
Gaussian distribution. 
Surface plot and contour plot of 
the log likelihood as a function of 
(j, and a. The data set of N = 5 
points had mean x =1.0 and 
S 2 = ]T(.x--.i) 2 = 1.0. 
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Figure 21.6. Enumeration of the 
entire (discretized) hypothesis 
space for a mixture of two 
Gaussians. Weight of the mixture 
components is 7Ti, 7T2 = 0.6, 0.4 in 
the top half and 0.8, 0.2 in the 
bottom half. Means jii and ^2 
vary horizontally, and standard 
deviations G\ and a 2 vary 
vertically. 



coefficients 7Ti and TT2, satisfying tt\ + 7T2 = 1, 7r^ > 0. 

DA \ 71-1 /" (x-m) 2 \ , n 2 ( (x-H2) 2 \ 

p(*Imi, 01, in, = ^ exp (-H^rJ + 7i^ ex p l-S^-J 

Let's enumerate the subhypotheses for this alternative model. The parameter 
space is five-dimensional, so it becomes challenging to represent it on a single 
page. Figure 21.6 enumerates 800 subhypotheses with different values of the 
five parameters H2, &i, ""l- The means are varied between five values 
each in the horizontal directions. The standard deviations take on four values 
each vertically. And ir\ takes on two values vertically. We can represent the 
inference about these five parameters in the light of the five datapoints as 
shown in figure 21.7. 

If we wish to compare the one-Gaussian model with the mixture-of-two 
model, we can find the models' posterior probabilities by evaluating the 
marginal likelihood or evidence for each model H, P{{x} \ H). The evidence 
is given by integrating over the parameters, 0; the integration can be imple- 
mented numerically by summing over the alternative enumerated values of 

0, 

P({x}\H) = Y / P ( d ) P (M\0,n) 1 (21.9) 


where P{6) is the prior distribution over the grid of parameter values, which 
I take to be uniform. 

For the mixture of two Gaussians this integral is a five-dimensional integral; 
if it is to be performed at all accurately, the grid of points will need to be 
much finer than the grids shown in the figures. If the uncertainty about each 
of K parameters has been reduced by, say, a factor of ten by observing the 
data, then brute force integration requires a grid of at least 10 points. This 
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Figure 21.7. Inferring a mixture of 
two Gaussians. Likelihood 
function, given the data of 
figure 21.3, represented by line 
thickness. The hypothesis space is 
identical to that shown in 
figure 21.6. Sub hypotheses having 
likelihood smaller than e~ 8 times 
the maximum likelihood are not 
shown, hence the blank regions, 
which correspond to hypotheses 
that the data have ruled out. 
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exponential growth of computation with model size is the reason why complete 
enumeration is rarely a feasible computational strategy. 

Exercise 21.3. ^ Imagine fitting a mixture of ten Gaussians to data in a 
twenty-dimensional space. Estimate the computational cost of imple- 
menting inferences for this model by enumeration of a grid of parameter 
values. 
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Maximum Likelihood and Clustering 



Rather than enumerate all hypotheses - which may be exponential in number 
- we can save a lot of time by homing in on one good hypothesis that fits 
the data well. This is the philosophy behind the maximum likelihood method, 
which identifies the setting of the parameter vector that maximizes the 
likelihood, P(Data|0,W). 

For some models the maximum likelihood parameters can be identified 
instantly from the data; for more complex models, finding the maximum like- 
lihood parameters may require an iterative algorithm. 

For any model, it is usually easiest to work with the logarithm of the 
likelihood rather than the likelihood, since likelihoods, being products of the 
probabilities of many data points, tend to be very small. Likelihoods multiply; 
log likelihoods add. 

► 22.1 Maximum likelihood for one Gaussian 

We return to the Gaussian for our first examples. Assume we have data 
{x n }n=i- The log likelihood is: 

lnP({x n }% =1 \fi,o-) = -Nln(V2^o-)-J2(xn-v) 2 /(2o- 2 ). (22.1) 

n 

The likelihood can be expressed in terms of two functions of the data, the 
sample mean 

N 

x=Y,^n/N, (22.2) 

n=l 

and the sum of square deviations 

S* = ^(x„-S) 2 : (22.3) 

n 

lnP({:c„}^ =1 \n,&) = -7Vln(\/2W) - [N(ji - x) 2 + S}/(2a 2 ). (22.4) 

Because the likelihood depends on the data only through x and S, these two 
quantities are known as sufficient statistics. 

Example 22.1. Differentiate the log likelihood with respect to fi and show that, 
if the standard deviation is known to be cr, the maximum likelihood mean 
\x of a Gaussian is equal to the sample mean x, for any value of cr. 



Solution. 



d N(u-x) 

— InP = 22 - 5 

= when fi = x. □ (22.6) 
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(b) 





(c) 




Figure 22.1. The likelihood 
function for the parameters of a 
Gaussian distribution, 
(al, a2) Surface plot and contour 
plot of the log likelihood as a 
function of ji and cr. The data set 
of N = 5 points had mean x = 1.0 
and S 2 = - = 1.0. 

(b) The posterior probability of fi 
for various values of a. 

(c) The posterior probability of a 
for various fixed values of fi 
(shown as a density over lnc/). 



0.4 0.6 0.8 1 1.2 1.4 1.61.8 2 



If we Taylor- expand the log likelihood about the maximum, we can de- 
fine approximate error bars on the maximum likelihood parameter: we use 
a quadratic approximation to estimate how far from the maximum-likelihood 
parameter setting we can go before the likelihood falls by some standard fac- 
tor, for example e 1 / 2 , or e 4 / 2 . In the special case of a likelihood that is a 
Gaussian function of the parameters, the quadratic approximation is exact. 

Example 22.2. Find the second derivative of the log likelihood with respect to 
H, and find the error bars on /i, given the data and a. 



Solution. 



lnP=-4 



□ 



(22.7) 



Comparing this curvature with the curvature of the log of a Gaussian distri- 
bution over fi of standard deviation cr^, exp(— // 2 /(2cr 2 )), which is 1/cr 2 , we 
can deduce that the error bars on fj, (derived from the likelihood function) are 



(22.; 



The error bars have this property: at the two points fi = xzta^, the likelihood 
is smaller than its maximum value by a factor of e 1 ! 2 . 

Example 22.3. Find the maximum likelihood standard deviation cr of a Gaus- 
sian, whose mean is known to be /!, in the light of data {x n }^ =1 . Find 
the second derivative of the log likelihood with respect to lncr, and error 
bars on In (j. 

Solution. The likelihood's dependence on a is 



lnP({xX = i a) = -Nla(V2na) 



(22.9) 



where S to t = X^n( x « ~ I 1 ) 2 ■ To find the maximum of the likelihood, we can 
differentiate with respect to In cr. [It's often most hygienic to differentiate with 
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respect to \nu rather than u, when u is a scale variable; we use dM n /d(ln u) = 



nu .\ 

I TV 



cHnu 

This derivative is zero when 



dlnP({x„}% =1 \fi,,a) _ N | 5tot (22 10) 



N ' 

i.e., 



a 2 = %, (22.11) 



(T 



E»=i(^»- ») 2 
N 



(22.12) 



The second derivative is 

d 2 lnP({x n }% =1 \»,o) _ S 



= -2^21, (22.13) 



d(lna) 2 a 

and at the maximum- likelihood value of cr 2 , this equals —2N. So error bars 
on In cr are 

aina = 7!F D (22J4) 

> Exercise 22.4.^ Show that the values of « and In cr that jointly maximize the 
likelihood are: {^t,cr} ML = |al, <7jv = %J S/N^ , where 



<^y E " =l( *" ^ . (22.15) 

► 22.2 Maximum likelihood for a mixture of Gaussians 

We now derive an algorithm for fitting a mixture of Gaussians to one- 
dimensional data. In fact, this algorithm is so important to understand that, 
you, gentle reader, get to derive the algorithm. Please work through the fol- 
lowing exercise. 

ft exercise 

22.5. [2, p.310] A random variable x is assumed to have a probability 
distribution that is a mixture of two Gaussians, 



P(x\ Mi,M2,c) 



El / (x-fj, k y 
Pk , ^- exp —2 — 
\/2^2 V 2cr 2 



(22.16) 



where the two Gaussians are given the labels k = 1 and k = 2; the prior 
probability of the class label k is {pi = l/2, p2 = l/2}; {fJ-k} ar e the means 
of the two Gaussians; and both have standard deviation a. For brevity, we 
denote these parameters by 6 = {{/i^}, cr}. 

A data set consists of N points {a;„}„ =1 which are assumed to be indepen- 
dent samples from this distribution. Let k n denote the unknown class label of 
the nth point. 

Assuming that {fJ-k} and cr are known, show that the posterior probability 
of the class label k n of the nth point can be written as 

P(k n = l\x n ,d) 



1 + exp[-(wix n + w )] 

1 (22.17) 

P{k n = 2 | X n , 0) = - ■ r— — ; TT, 

1 + exp[+(wix„ + w )\ 
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and give expressions for w\ and wr> 

Assume now that the means {fJ.k} are not known, and that we wish to 
infer them from the data {x n }^ =1 . (The standard deviation cr is known.) In 
the remainder of this question we will derive an iterative algorithm for finding 
values for {/ifc} that maximize the likelihood, 

p({*n)Li i w, <*) = n P ( X " I 0*}> ( 22 - is ) 

n 

Let L denote the natural log of the likelihood. Show that the derivative of the 
log likelihood with respect to fik is given by 

ju=x>,„m, (22i9) 

where Pk\ n = P(k n = k \ x n , 0) appeared above at equation (22.17). 

Show, neglecting terms in -gj^;P{kn = k I x n ,6), that the second derivative 
is approximately given by 

Tr^L=-Y J Pk\n—- (22.20) 

m v 

Hence show that from an initial state fii,fi2, an approximate Newton-Raphson 
step updates these parameters to /it'i,^, where 

f/ k = ^ Pk ^ Xn . (22.21) 

2— in Pk\n 

The Newton-Raphson method for maximizing L(fi) updates \i to /J,' = \i — 

" dL I d 2 L] 
dn j du 2 



Assuming that a = 1, sketch a contour plot of the likelihood function as a 
function of fi\ and /12 for the data set shown above. The data set consists of 
32 points. Describe the peaks in your sketch and indicate their widths. 

Notice that the algorithm you have derived for maximizing the likelihood 
is identical to the soft K-means algorithm of section 20.4. Now that it is clear 
that clustering can be viewed as mixture-density-modelling, we are able to 
derive enhancements to the K-means algorithm, which rectify the problems 
we noted earlier. 



► 22.3 Enhancements to soft K-means 

Algorithm 22.2 shows a version of the soft-K-means algorithm corresponding 
to a modelling assumption that each cluster is a spherical Gaussian having its 
own width (each cluster has its own [3^ =1/ ofy. The algorithm updates the 
lengthscales Ofc for itself. The algorithm also includes cluster weight parame- 
ters 7Ti , 7T2 , . . . , ttk which also update themselves, allowing accurate modelling 
of data from clusters of unequal weights. This algorithm is demonstrated in 
figure 22.3 for two data sets that we've seen before. The second example shows 
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Assignment step. The responsibilities are 



» 



^(7lfeF ex p(-^( m(fe) ' x(ri) )) 



■d(m( fe '), X W) 



(22.22) 



where / is the dimensionality of x. 

Update step. Each cluster's parameters, m^ k \ 7Tfc, and <t 2 , are adjusted 
to match the data points that it is responsible for. 



E 



(«) (n) 



rl 'x 



m 



(fe) _ _n_ 



4 



R( k ) 

^r<"'(x(")-m«) 2 

n 

ZrW 

TTfe 



where i?( fc ' is the total responsibility of mean k, 



(22.23) 

(22.24) 
(22.25) 

(22.26) 



Algorithm 22.2. The soft K-mcans 
algorithm, version 2. 



t = 



t = 




i = 1 



£ = 2 



i = 1 




t = 10 



t = 3 



t = 20 



(3 








o 


















++^- — ■ ^ 


++ 


++ 



t = 9 Figure 22.3. Soft K-mcans 

algorithm, with K = 2, applied 
(a) to the 40-point data set of 
figure 20.3; (b) to the little 'n' 
large data set of figure 20.5. 



t = 30 t = 35 




» 



n-=i 



(/,:) 



exp 



E 

t=i 



(mf)-^) 2 /2 ( af)) 2 



J2 k , (numerator, with k! in place of k) 
5>W(*W-m«) a 



(22.27) 



(22.28) 



Algorithm 22.4. The soft K-means 
algorithm, version 3, which 
corresponds to a model of 
axis-aligned Gaussians. 
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Figure 22.5. Soft K-means 
algorithm, version 3, applied to 
the data consisting of two 
cigar-shaped clusters. K = 2 (cf. 
figure 20.6). 



t = t = 10 t = 20 t = 26 t = 32 Figure 22.6. Soft K-means 




algorithm, version 3, applied to 
the little 'n' large data set. K = 2. 



that convergence can take a long time, but eventually the algorithm identifies 
the small cluster and the large cluster. 

Soft K-means, version 2, is a maximum-likelihood algorithm for fitting a 
mixture of spherical Gaussians to data - 'spherical' meaning that the variance A proof that the algorithm does 
of the Gaussian is the same in all directions. This algorithm is still no good indeed maximize the likelihood is 
at modelling the cigar-shaped clusters of figure 20.6. If we wish to model the deferred to section 33.7. 
clusters by axis-aligned Gaussians with possibly-unequal variances, we replace 
the assignment rule (22.22) and the variance update rule (22.24) by the rules 
(22.27) and (22.28) displayed in algorithm 22.4. 

This third version of soft K-means is demonstrated in figure 22.5 on the 
'two cigars' data set of figure 20.6. After 30 iterations, the algorithm correctly 
locates the two clusters. Figure 22.6 shows the same algorithm applied to the 
little 'n' large data set; again, the correct cluster locations are found. 



► 22.4 A fatal flaw of maximum likelihood 

Finally, figure 22.7 sounds a cautionary note: when we fit K = 4 means to our 
first toy data set, we sometimes find that very small clusters form, covering 
just one or two data points. This is a pathological property of soft K-means 
clustering, versions 2 and 3. 

> Exercise 22.6. ^ Investigate what happens if one mean m^* 1 sits exactly on 
top of one data point; show that if the variance a\ is sufficiently small, 
then no return is possible: a? becomes ever smaller. 



t = 



t = 5 



t = 10 



t = 20 






ft 



Figure 22.7. Soft K-means 
algorithm applied to a data set of 
40 points. K = 4. Notice that at 
convergence, one very small 
cluster has formed between two 
data points. 
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KABOOM! 

Soft K-means can blow up. Put one cluster exactly on one data point and let its 
variance go to zero - you can obtain an arbitrarily large likelihood! Maximum 
likelihood methods can break down by finding highly tuned models that fit part 
of the data perfectly. This phenomenon is known as overfitting. The reason 
we are not interested in these solutions with enormous likelihood is this: sure, 
these parameter-settings may have enormous posterior probability density, 
but the density is large over only a very small volume of parameter space. So 
the probability mass associated with these likelihood spikes is usually tiny. 

We conclude that maximum likelihood methods are not a satisfactory gen- 
eral solution to data-modelling problems: the likelihood may be infinitely large 
at certain parameter settings. Even if the likelihood does not have infinitely- 
large spikes, the maximum of the likelihood is often unrepresentative, in high- 
dimensional problems. 

Even in low-dimensional problems, maximum likelihood solutions can be 
unrepresentative. As you may know from basic statistics, the maximum like- 
lihood estimator (22.15) for a Gaussian's standard deviation, a N , is a biased 
estimator, a topic that we'll take up in Chapter 24. 



The maximum a posteriori (MAP) method 

A popular replacement for maximizing the likelihood is maximizing the 
Bayesian posterior probability density of the parameters instead. However, 
multiplying the likelihood by a prior and maximizing the posterior does 
not make the above problems go away; the posterior density often also has 
infinitely-large spikes, and the maximum of the posterior probability density 
is often unrepresentative of the whole posterior distribution. Think back to 
the concept of typicality, which we encountered in Chapter 4: in high dimen- 
sions, most of the probability mass is in a typical set whose properties are 
quite different from the points that have the maximum probability density. 
Maxima are atypical. 

A further reason for disliking the maximum a posteriori is that it is basis- 
dependent. If we make a nonlinear change of basis from the parameter 6 to 
the parameter u = f{6) then the probability density of 9 is transformed to 

P(u)=P{0) 



du 



(22.29) 



The maximum of the density P{u) will usually not coincide with the maximum 
of the density P(0). (For figures illustrating such nonlinear changes of basis, 
see the next chapter.) It seems undesirable to use a method whose answers 
change when we change representation. 



Further reading 

The soft K-means algorithm is at the heart of the automatic classification 
package, AutoClass (Hanson et al, 1991b; Hanson et al, 1991a). 



► 22.5 Further exercises 

Exercises where maximum likelihood may be useful 

Exercise 22.7. [3] Make a version of the K-means algorithm that models the 
data as a mixture of K arbitrary Gaussians, i.e., Gaussians that are not 
constrained to be axis-aligned. 
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> Exercise 22.8. ^ (a) A photon counter is pointed at a remote star for one 
minute, in order to infer the brightness, i.e., the rate of photons 
arriving at the counter per minute, A. Assuming the number of 
photons collected r has a Poisson distribution with mean A, 



P(r\X) =exp(-A) 



A r 



(22.30) 



what is the maximum likelihood estimate for A, given r = 9? Find 
error bars on In A. 

(b) Same situation, but now we assume that the counter detects not 
only photons from the star but also 'background' photons. The 
background rate of photons is known to be 6=13 photons per 
minute. We assume the number of photons collected, r, has a Pois- 
son distribution with mean \+b. Now, given r = 9 detected photons, 
what is the maximum likelihood estimate for A? Comment on this 
answer, discussing also the Bayesian posterior distribution, and the 
'unbiased estimator' of sampling theory, A = r — b. 

bxercise 22.9. [2] A bent coin is tossed N times, giving N a heads and Nf, tails. 
Assume a beta distribution prior for the probability of heads, p, for 
example the uniform distribution. Find the maximum likelihood and 
maximum a posteriori values of p, then find the maximum likelihood 
and maximum a posteriori values of the logit a = ln[p/(l— p)]. Compare 
with the predictive distribution, i.e., the probability that the next toss 
will come up heads. 

> Exercise 22.10. [2] Two men looked through prison bars; one saw stars, the 

other tried to infer where the window frame was. 

From the other side of a room, you look through a window and see stars 
at locations {(x n , y n )}. You can't see the window edges because it is to- 
tally dark apart from the stars. Assuming the window is rectangular and 
that the visible stars' locations are independently randomly distributed, 
what are the inferred values of (x m i n , y m i n , £max, y m ax), according to 
maximum likelihood? Sketch the likelihood as a function of x max , for 
fixed £ m i n , y m i n , and y max - 

> Exercise 22.11. ^ A sailor infers his location (x,y) by measuring the bearings 

of three buoys whose locations (x n ,y n ) are given on his chart. Let the 
true bearings of the buoys be 6 n . Assuming that his measurement 9 n of 
each bearing is subject to Gaussian noise of small standard deviation cr, 
what is his inferred location, by maximum likelihood? 

The sailor's rule of thumb says that the boat's position can be taken to 
be the centre of the cocked hat, the triangle produced by the intersection 
of the three measured bearings (figure 22.8). Can you persuade him that 
the maximum likelihood answer is better? 




(Xl 




[> Exercise 22.12. [5 ' p - 310] Maximum likelihood fitting of an exponential-family Figure 22.8. The standard way of 
moc j I drawing three slightly inconsistent 

bearings on a chart produces a 
Assume that a variable x comes from a probability distribution of the triangle called a cocked hat. 
form 



Where is the sailor? 



P ( x l w ) = Zjwj 6XP (^2 Wk ^^j ' 



(22.31) 
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where the functions /fc(x) are given, and the parameters w = {wfc} are 
not known. A data set {x^} of N points is supplied. 

Show by differentiating the log likelihood that the maximum-likelihood 
parameters w ML satisfy 

x n 

where the left-hand sum is over all x, and the right-hand sum is over the 
data points. A shorthand for this result is that each function-average 
under the fitted model must equal the function- average found in the 
data: 

(/fc)p(x|w ML ) = (A)Data ■ (22.33) 

> Exercise 22.13. ^ 'Maximum entropy' fitting of models to constraints. 

When confronted by a probability distribution P(x) about which only a 
few facts are known, the maximum entropy principle (maxent) offers a 
rule for choosing a distribution that satisfies those constraints. Accord- 
ing to maxent, you should select the -P(x) that maximizes the entropy 

ff = £)P(x)logl/P(x), (22.34) 

X 

subject to the constraints. Assuming the constraints assert that the 
averages of certain functions /fc(x) are known, i.e., 

(A)p(x) = Fh, (22.35) 

show, by introducing Lagrange multipliers (one for each constraint, in- 
cluding normalization), that the maximum-entropy distribution has the 
form 



-P(x)lVIaxent — 




(22.36) 



where the parameters Z and {w^} are set such that the constraints 
(22.35) are satisfied. 

And hence the maximum entropy method gives identical results to max- 
imum likelihood fitting of an exponential-family model (previous exer- 
cise) . 

The maximum entropy method has sometimes been recommended as a 
method for assigning prior distributions in Bayesian modelling. While 
the outcomes of the maximum entropy method are sometimes interesting 
and thought-provoking, I do not advocate maxent as the approach to 
assigning priors. 

Maximum entropy is also sometimes proposed as a method for solv- 
ing inference problems - for example, 'given that the mean score of 
this unfair six-sided die is 2.5, what is its probability distribution 
(pi,P2,P3,P4,P5,P6)?' I think it is a bad idea to use maximum entropy 
in this way; it can give silly answers. The correct way to solve inference 
problems is to use Bayes' theorem. 
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Exercises where maximum likelihood and MAP have difficulties 

> Exercise 22.14. ^ This exercise explores the idea that maximizing a proba- 

bility density is a poor way to find a point that is representative of 
the density. Consider a Gaussian distribution in a fc-dimensional space, 
P(w) = {l/V2^cr w ) k exp{-Y l k 1 wf/2a'i). Show that nearly all of the 
probability mass of a Gaussian is in a thin shell of radius r = Vka w 
and of thickness proportional to r/\fk. For example, in 1000 dimen- 
sions, 90% of the mass of a Gaussian with a w = 1 is in a shell of radius 
31.6 and thickness 2.8. However, the probability density at the origin is 
e fc/2 ^ io 217 times bigger than the density at this shell where most of 
the probability mass is. 

Now consider two Gaussian densities in 1000 dimensions that differ in 
radius a w by just 1%, and that contain equal total probability mass. 
Show that the maximum probability density is greater at the centre of 
the Gaussian with smaller a w by a factor of ~exp(0.01fc) ~ 20000. 

In ill-posed problems, a typical posterior distribution is often a weighted 
superposition of Gaussians with varying means and standard deviations, 
so the true posterior has a skew peak, with the maximum of the prob- 
ability density located near the mean of the Gaussian distribution that 
has the smallest standard deviation, not the Gaussian with the greatest 
weight. 

> Exercise 22. 15 J 5 ' The seven scientists. N datapoints {x n } are drawn from 

N distributions, all of which are Gaussian with a common mean \i but 
with different unknown standard deviations a n . What are the maximum 
likelihood parameters fi, {cr n } given the data? For example, seven 
scientists (A, B, C, D, E, F, G) with wildly-differing experimental skills 
measure fi. You expect some of them to do accurate work (i.e., to have 
small <r n ), and some of them to turn in wildly inaccurate answers (i.e., 
to have enormous a n ). Figure 22.9 shows their seven results. What is 
u, and how reliable is each scientist? 

I hope you agree that, intuitively, it looks pretty certain that A and B 
are both inept measurers, that D-G are better, and that the true value 
of u is somewhere close to 10. But what does maximizing the likelihood 
tell you? 

Exercise 22.16. ^ Problems with MAP method. A collection of widgets i = 
l,...,k have a property called 'wodge', Wj, which we measure, wid- 
get by widget, in noisy experiments with a known noise level a v = 1.0. 
Our model for these quantities is that they come from a Gaussian prior 
P(wi | a) = Normal(0, Va), where ct = 1/(7%, is not known. Our prior for 
this variance is flat over logov from a w = 0.1 to a w = 10. 

Scenario 1. Suppose four widgets have been measured and give the fol- 
lowing data: {^1,^2,^3,^4} = {2.2, —2.2, 2.8, —2.8}. We are interested 
in inferring the wodges of these four widgets. 

(a) Find the values of w and a that maximize the posterior probability 
P(w, log a I d). 

(b) Marginalize over a and find the posterior probability density of w 
given the data. [Integration skills required. See MacKay (1999a) for 
solution.] Find maxima of P(w | d). [Answer: two maxima - one at 
w mp = {1-8, —1.8, 2.2, —2.2}, with error bars on all four parameters 



B CD-G 



-20 -10 


10 


Scientist 




A 


-27.020 


B 


3.570 


C 


8.191 


D 


9.898 


E 


9.603 


F 


9.945 


G 


10.056 



Figure 22.9. Seven measurements 
{x n } of a parameter fj, by seven 
scientists each having his own 
noise- level a n . 
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(obtained from Gaussian approximation to the posterior) ±0.9; and 
one at w' MP = {0.03, -0.03, 0.04, -0.04} with error bars ±0.1.] 

Scenario 2. Suppose in addition to the four measurements above we are 
now informed that there are four more widgets that have been measured 
with a much less accurate instrument, having o' v = 100.0. Thus we now 
have both well-determined and ill-determined parameters, as in a typical 
ill-posed problem. The data from these measurements were a string of 
uninformative values, {^5,^6,^7,^8} = {100, —100, 100, —100}. 

We are again asked to infer the wodges of the widgets. Intuitively, our 
inferences about the well-measured widgets should be negligibly affected 
by this vacuous information about the poorly-measured widgets. But 
what happens to the MAP method? 

(a) Find the values of w and a that maximize the posterior probability 
P(w,loga I d). 

(b) Find maxima of P(w|d). [Answer: only one ni ax 1 mum , w yp — 
{0.03, -0.03, 0.03, -0.03, 0.0001, -0.0001, 0.0001, -0.0001}, with 
error bars on all eight parameters ±0.11.] 

► 22.6 Solutions 

Solution to exercise 22.5 (p. 302). Figure 22.10 shows a contour plot of the 
likelihood function for the 32 data points. The peaks are pretty-near centred 
on the points (1,5) and (5, 1), and are pretty-near circular in their contours. 
The width of each of the peaks is a standard deviation of er/\/l6 =1/4. The 
peaks are roughly Gaussian in shape. 

Solution to exercise 22.12 (p. 307). The log likelihood is: 

lnP({x<">} I w) = -JVlnZ(w) + (* (n) )- (22.37) 



Figure 22.10. The likelihood as a 
function of /ii and /i 2 . 



dwi 



■lnP({x(")}|w) 



-TV 



dw k 



lnZ(w)+£/ fc (x). 



(22.38) 



Now, the fun part is what happens when we differentiate the log of the nor- 
malizing constant: 



so 



1 x - d 



^Wfc'/fc'(x) 



(22.39) 



^- lnP({xW} I w) = -TV £ P(x I w)/ fc (x) + E /*(x), (22.40) 

k x n 

and at the maximum of the likelihood, 

5>(x|w ML )A(x) = ^E^( x(n) )- ( 22 - 41 ) 
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Useful Probability Distributions 



In Bayesian data modelling, there's a small collection of probability distribu- 
tions that come up again and again. The purpose of this chapter is to intro- 
duce these distributions so that they won't be intimidating when encountered 
in combat situations. 

There is no need to memorize any of them, except perhaps the Gaussian; 
if a distribution is important enough, it will memorize itself, and otherwise, it 
can easily be looked up. 

► 23.1 Distributions over integers 

Binomial, Poisson, exponential 

We already encountered the binomial distribution and the Poisson distribution 
on page 2. 

The binomial distribution for an integer r with parameters / (the bias, 
/ e [0, 1]) and N (the number of trials) is: 



P(r\f,N) 



N 



f(W) 



N-r 



re {0,1,2,..., N}. 



(23.1) 



The binomial distribution arises, for example, when we flip a bent coin, 
with bias /, N times, and observe the number of heads, r. 



The Poisson distribution with parameter A > is: 

re {0,1,2,...}. 



P(r\\) = e- xXT 



(23.2) 



The Poisson distribution arises, for example, when we count the number of 
photons r that arrive in a pixel during a fixed interval, given that the mean 
intensity on the pixel corresponds to an average number of photons A. 



The exponential distribution on integers,, 

P(r\f) = f r (l- f) re(0,l,2 oo), 



(23.3) 



arises in waiting problems. How long will you have to wait until a six is rolled, 
if a fair six-sided dice is rolled? Answer: the probability distribution of the 
number of rolls, r, is exponential over integers with parameter / = 5/6. The 
distribution may also be written 



P(r\f) = (l-f)e- Xr re (0,1,2,..., oo), 
where A = ln(l//). 



(23.4) 




1 

0.1 
0.01 
0.001 -■ 
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Figure 23.1. The binomial 
distribution P(r\f = 0.3, N = 
on a linear scale (top) and a 
logarithmic scale (bottom). 
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Figure 23.2. The Poisson 
distribution P(r\ A = 2.7), on 
linear scale (top) and a 
logarithmic scale (bottom). 
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► 23.2 Distributions over unbounded real numbers 

Gaussian, Student, Cauchy, biexponential, inverse-cosh. 

The Gaussian distribution or normal distribution with mean fj, and standard 
deviation a is 



P(x | n, a) = — exp 



2a 2 



x e (— oo, oo), 



(23.5) 



where 



V2~7 



(23.6) 

It is sometimes useful to work with the quantity r = 1 /a 2 , which is called the 
precision parameter of the Gaussian. 

A sample z from a standard univariate Gaussian can be generated by 
computing 

z = cos{2iru 1 ) v / 2 ln(l/u 2 ), (23.7) 

where u\ and U2 are uniformly distributed in (0,1). A second sample Z2 = 
sin(27rui ) y/2 ln( T/vq ) , independent of the first, can then be obtained for free. 

The Gaussian distribution is widely used and often asserted to be a very 
common distribution in the real world, but I am sceptical about this asser- 
tion. Yes, unimodal distributions may be common; but a Gaussian is a spe- 
cial, rather extreme, unimodal distribution. It has very light tails: the log- 
probability-density decreases quadratically. The typical deviation of x from fj, 
is a, but the respective probabilities that x deviates from \x by more than 2a, 
3er, 4(7, and 5cr, are 0.046, 0.003, 6 x 10~ 5 , and 6 x 10~ 7 . In my experience, 
deviations from a mean four or five times greater than the typical deviation 
may be rare, but not as rare as 6 x 10~ 5 ! I therefore urge caution in the use of 
Gaussian distributions: if a variable that is modelled with a Gaussian actually 
has a heavier-tailed distribution, the rest of the model will contort itself to 
reduce the deviations of the outliers, like a sheet of paper being crushed by a 
rubber band. 

> Exercise 23. 1. ^ ^ Pick a variable that is supposedly bell-shaped in probability 
distribution, gather data, and make a plot of the variable's empirical 
distribution. Show the distribution as a histogram on a log scale and 
investigate whether the tails are well-modelled by a Gaussian distribu- 
tion. [One example of a variable to study is the amplitude of an audio 
signal.] 

One distribution with heavier tails than a Gaussian is a mixture of Gaus- 
sians. A mixture of two Gaussians, for example, is defined by two means, 
two standard deviations, and two mixing coefficients tt\ and 7T 2 , satisfying 

7Ti + 7T 2 = 1, TTj > 0. 



P(X I Hi, (71, 71-1,^2,0-2,^2) 



7T1 



27r(7l 



exp 



7T 2 



27TO-2 



exp 



V 2a§ J' 



If we take an appropriately weighted mixture of an infinite number of 
Gaussians, all having mean \i, we obtain a Student-t distribution, 



P(x I /1, s, n) 



1 



1 



Z(l + (x- /i) 2 /(ns 2 ))(™+ 1 )/ 2 ' 



where 



Tins' 



T(n/2) 



r((n+l)/2) 



(23.8) 



(23.9) 




Figure 23.3. Three unimodal 
distributions. Two Student 
distributions, with parameters 
(rn, s) = (1, 1) (heavy line) (a 
Cauchy distribution) and (2,4) 
(light line), and a Gaussian 
distribution with mean fi = 3 and 
standard deviation a = 3 (dashed 
line), shown on linear vertical 
scales (top) and logarithmic 
vertical scales (bottom) . Notice 
that the heavy tails of the Cauchy 
distribution arc scarcely evident 
in the upper 'bell-shaped curve'. 
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and n is called the number of degrees of freedom and T is the gamma function. 
If n > 1 then the Student distribution (23.8) has a mean and that mean is 
u,. If n > 2 the distribution also has a finite variance, cr 2 = ns 2 /(n — 2). 
As n — > oo, the Student distribution approaches the normal distribution with 
mean fi and standard deviation s. The Student distribution arises both in 
classical statistics (as the sampling-theoretic distribution of certain statistics) 
and in Bayesian inference (as the probability distribution of a variable coming 
from a Gaussian distribution whose standard deviation we aren't sure of). 

In the special case n = 1, the Student distribution is called the Cauchy 
distribution. 

A distribution whose tails are intermediate in heaviness between Student 
and Gaussian is the biexponential distribution, 

P(x \n,s) = ^ exp ^- ^ ~ f^ j x e (-00,00) (23.10) 

where 

Z=2s. (23.11) 



The inverse-cosh distribution 

P(x\/3) oc- j— — -TTja (23.12) 

V 1 ' [cosh(/3a;)] 1 //3 v ' 

is a popular model in independent component analysis. In the limit of large (3, 
the probability distribution P(x \ (3) becomes a biexponential distribution. In 
the limit (3 — > P(x \ (3) approaches a Gaussian with mean zero and variance 
1/(3. 



► 23.3 Distributions over positive real numbers 

Exponential, gamma, inverse-gamma, and log-normal. 

The exponential distribution, 

P(x\s) = -exp(--) x e (0,oo), (23.13) 

where 

Z = s, (23.14) 

arises in waiting problems. How long will you have to wait for a bus in Pois- 
sonville, given that buses arrive independently at random with one every s 
minutes on average? Answer: the probability distribution of your wait, x, is 
exponential with mean s. 

The gamma distribution is like a Gaussian distribution, except whereas the 
Gaussian goes from —00 to 00, gamma distributions go from to 00. Just as 
the Gaussian distribution has two parameters fi and a which control the mean 
and width of the distribution, the gamma distribution has two parameters. It 
is the product of the one-parameter exponential distribution (23.13) with a 
polynomial, x c ~ l . The exponent c in the polynomial is the second parameter. 

P(x\s,c) = T{x;s,c) = tf(^) C ^xp (-^ ) , < x < 00 (23.15) 



where 



Z = r(c)s. 



(23.16) 
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0.0001 




Figure 23.4. Two gamma 
distributions, with parameters 
(s, c) = (1, 3) (heavy lines) and 
10,0.3 (light lines), shown on 
linear vertical scales (top) and 
logarithmic vertical scales 
(bottom); and shown as a 
function of x on the left (23.15) 
and I = lnx on the right (23.18). 



In a; 



This is a simple peaked distribution with mean sc and variance s 2 c. 

It is often natural to represent a positive real variable x in terms of its 
logarithm I = In x. The probability density of I is 



P(l) = P(x(l)) 



dx 

In 



P(x(l))x(l) (23.17) 



1 fx(l)\ c ( x(l)\ 

' exp K -L , (23.18) 



Zi\ s I \ s 



where 



T(c). (23.19) 



[The gamma distribution is named after its normalizing constant - an odd 
convention, it seems to me!] 

Figure 23.4 shows a couple of gamma distributions as a function of x and 
of I. Notice that where the original gamma distribution (23.15) may have a 
'spike' at x = 0, the distribution over I never has such a spike. The spike is 
an artefact of a bad choice of basis. 

In the limit sc = l,c — ► 0, we obtain the noninformative prior for a scale 
parameter, the 1/x prior. This improper prior is called noninformative because 
it has no associated length scale, no characteristic value of x, so it prefers all 
values of x equally. It is invariant under the reparameterization x = mx. If 
we transform the 1 jx probability density into a density over I = In x we find 
the latter density is uniform. 

> Exercise 23.2.^ Imagine that we reparameterize a positive variable x in terms 
of its cube root, u = x 1 ^. If the probability density of x is the improper 
distribution 1/x, what is the probability density of u? 

The gamma distribution is always a unimodal density over I = lnx, and, 
as can be seen in the figures, it is asymmetric. If x has a gamma distribution, 
and we decide to work in terms of the inverse of x, v = 1/x, we obtain a new 
distribution, in which the density over I is flipped left-for-right: the probability 
density of v is called an inverse-gamma distribution, 

1 ( 1 \ c+1 / 1 



P(v I s , c) = — — exp , < v < oo (23.20) 

Z v \sv J \ svj 

where 

Z v =T(c)/s. (23.21) 
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0.2 

0.1 - 




0.1 
0.01 
0.001 
0.0001 








- 





Figure 23.5. Two inverse gamma 
distributions, with parameters 
(s, c) = (1, 3) (heavy lines) and 
10,0.3 (light lines), shown on 
linear vertical scales (top) and 
logarithmic vertical scales 
(bottom); and shown as a 
function of x on the left and 
I = In a; on the right. 



Inv 



Gamma and inverse gamma distributions crop up in many inference prob- 
lems in which a positive quantity is inferred from data. Examples include 
inferring the variance of Gaussian noise from some noise samples, and infer- 
ring the rate parameter of a Poisson distribution from the count. 

Gamma distributions also arise naturally in the distributions of waiting 
times between Poisson-distributed events. Given a Poisson process with rate 
A, the probability density of the arrival time x of the mth event is 



X(\x) 



m— 1 



— Xx 



(m-l)! 



(23.22) 



Log-normal distribution 



Another distribution over a positive real number x is the log-normal distribu- 
tion, which is the distribution that results when / = lnx has a normal distri- 
bution. We define m to be the median value of x, and s to be the standard 
deviation of In x . 



P(l \ m,s) = — exp 
Z 



(I — him)' 
2s~ 2 



I € (-co, oo), 



where 
implies 



P(x I m, s) = — exp 

x 



Z=V2ns 2 , 

(Inx — mm) 2 
2? 



x e (o, oo). 



(23.23) 
(23.24) 

(23.25) 




0.01 
0.001 - 
0.0001 



Figure 23.6. Two log-normal 
distributions, with parameters 
(rn, s) = (3, 1.8) (heavy line) and 
(3,0.7) (light line), shown on 
linear vertical scales (top) and 
logarithmic vertical scales 
(bottom). [Yes, they really do 
have the same value of the 
median, rn = 3.] 



► 23.4 Distributions over periodic variables 

A periodic variable 6 is a real number € [0, 2ir] having the property that 9=0 
and 8 = 2ir are equivalent. 

A distribution that plays for periodic variables the role played by the Gaus- 
sian distribution for real variables is the Von Mises distribution: 

P{6\n,(3) = -exp(/3cos(6»- / ti)) 0e(O,27r). (23.26) 

Zj 

The normalizing constant is Z = 27rJ (/3), where Iq{x) is a modified Bessel 
function. 
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A distribution that arises from Brownian diffusion around the circle is the 
wrapped Gaussian distribution, 

oo 

P(0\h,<t) = Normal(0; (fj, + 2im) ,<r) 6 E (0, 2tt) . (23.27) 



► 23.5 Distributions over probabilities 

Beta distribution, Dirichlet distribution, entropic distribution 

The beta distribution is a probability density over a variable p that is a prob- 
ability p £ (0, 1): 



P(p\ U\,U 2 ) 



Z(ui,u 2 



m— l 



(i-p) 



U2 — 1 



(23.28) 



The parameters ui,u 2 may take any positive value. The normalizing constant 
is the beta function, 

r(«i)r(«2) 



Z(ui,u 2 ) = 



(23.29) 



r(«i + u 2 ) 

Special cases include the uniform distribution - u\ = l,u 2 = l; the Jeffreys 
prior - Mi = 0.5, u 2 = 0.5; and the improper Laplace prior - mi = 0,«2 = 0. If 
we transform the beta distribution to the corresponding density over the logit 
I = In p/ (1 — p), we find it is always a pleasant bell-shaped density over I, while 
the density over p may have singularities at p = and p = 1 (figure 23.7). 

More dimensions 

The Dirichlet distribution is a density over an /-dimensional vector p whose 
/ components are positive and sum to 1. The beta distribution is a special 
case of a Dirichlet distribution with 1 = 2. The Dirichlet distribution is 
parameterized by a measure u (a vector with all coefficients Ui > 0) which 
I will write here as u = am, where m is a normalized measure over the / 
components (X^ m i = l)i an d c is positive: 





0.6 
0.5 
0.4 
0.3 
0.2 
0.1 




Figure 23.7. Three beta 
distributions, with 
(ui,w 2 ) = (0.3,1), (1.3,1), and 
(12,2). The upper figure shows 
P(p | in, U2) as a function of p; the 
lower shows the corresponding 
density over the logit, 

In- 



1 — p 



Notice how well-behaved the 
densities are as a function of the 
logit. 



P(p I am) 



1 



Z(am) 



fjp" mi_1 <5 (Y,iPi - 1) = Dirichlet (/) (p | am). (23.30) 



The function 5(x) is the Dirac delta function, which restricts the distribution 
to the simplex such that p is normalized, i.e., ^ ji Pi = 1. The normalizing 
constant of the Dirichlet distribution is: 



Z{am) = Y[T{ami) /T{a) . 



(23.31) 



The vector m is the mean of the probability distribution: 
Dirichlet (/) (p I am) p d p = m. 



(23.32) 



When working with a probability vector p, it is often helpful to work in the 
'softmax basis', in which, for example, a three-dimensional probability p = 
(pi ,p 2 ,pz) is represented by three numbers a\ , a 2 , as satisfying a\ + a 2 + as = 
and 

(23.33) 



Pi 



— e ai , where Z = J2i e<li ■ 

Zj 



This nonlinear transformation is analogous to the <r — > In <j transformation 
for a scale variable and the logit transformation for a single probability, p — > 
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u= (20, 10,7) u= (0.2, 1,2) u = (0.2,0.3,0.15) 



Figure 23.8. Three Dirichlet 
distributions over a 
three-dimensional probability 
vector (pi,p 2 ,P3). The upper 
figures show 1000 random draws 
from each distribution, showing 
the values of p± and P2 on the two 
axes. p3 = 1 — (pi +P2)- The 
triangle in the first figure is the 
simplex of legal probability 
distributions. 

The lower figures show the same 
points in the 'softmax' basis 
(equation (23.33)). The two axes 
show cii and a 2 . 03 = — a\ — a 2 . 



-4 4 



-4 4 



-4 4 



lriy^. In the softmax basis, the ugly minus-ones in the exponents in the 
Dirichlet distribution (23.30) disappear, and the density is given by: 



(23.34) 



The role of the parameter a can be characterized in two ways. First, a mea- 
sures the sharpness of the distribution (figure 23.8); it measures how different 
we expect typical samples p from the distribution to be from the mean m, just 
as the precision r = V<r 2 of a Gaussian measures how far samples stray from its 
mean. A large value of a produces a distribution over p that is sharply peaked 
around m. The effect of a in higher-dimensional situations can be visualized 
by drawing a typical sample from the distribution Dirichlet ^' (p | am) , with 
m set to the uniform vector mj = Vl, and making a Zipf plot, that is, a ranked 
plot of the values of the components p,. It is traditional to plot both pi (ver- 
tical axis) and the rank (horizontal axis) on logarithmic scales so that power 
law relationships appear as straight lines. Figure 23.9 shows these plots for a 
single sample from ensembles with / = 100 and / = 1000 and with a from 0.1 
to 1000. For large a, the plot is shallow with many components having simi- 
lar values. For small a, typically one component pi receives an overwhelming 
share of the probability, and of the small probability that remains to be shared 
among the other components, another component py receives a similarly large 
share. In the limit as a goes to zero, the plot tends to an increasingly steep 
power law. 

Second, we can characterize the role of a in terms of the predictive dis- 
tribution that results when we observe samples from p and obtain counts 
F = (Fi, F 2 , . . . , Fj) of the possible outcomes. The value of a defines the 
number of samples from p that are required in order that the data dominate 
over the prior in predictions. 

Exercise 23.3.^ The Dirichlet distribution satisfies a nice additivity property. 
Imagine that a biased six-sided die has two red faces and four blue faces. 
The die is rolled N times and two Bayesians examine the outcomes in 
order to infer the bias of the die and make predictions. One Bayesian 
has access to the red/blue colour outcomes only, and he infers a two- 
component probability vector (pr,Pb). The other Bayesian has access 
to each full outcome: he can see which of the six faces came up, and 
he infers a six-component probability vector (pi,P2,P3,P4,P5,Pe)j where 



100 



10 
100 
1000 



1000 



100 

1000 



Figure 23.9. Zipf plots for random 
samples from Dirichlet 
distributions with various values 
of a = 0.1 . . . 1000. For each value 
of I = 100 or 1000 and each a, 
one sample p from the Dirichlet 
distribution was generated. The 
Zipf plot shows the probabilities 
Pi, ranked by magnitude, versus 
their rank. 
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PR = Pi + pi and pb = P3 + Pa + P5 + P6- Assuming that the sec- 
ond Bayesian assigns a Dirichlet distribution to (pi,P2,P3,PA,P5,P6) with 
hyperparameters (u\, U2, U3, W4, U5, uq), show that, in order for the first 
Bayesian's inferences to be consistent with those of the second Bayesian, 
the first Bayesian's prior should be a Dirichlet distribution with hyper- 
parameters ((ui + U2), (us + W4 + «5 + Ug)). 

Hint: a brute-force approach is to compute the integral P(pr,Pb) = 
Jd 6 pP(p I u) S(p R - (pi + p 2 )) S(p B - (p 3 +P4+P5+ Pe))- A cheaper 
approach is to compute the predictive distributions, given arbitrary data 
(Fi ,F2,Fs, F4, F5, Fq), and find the condition for the two predictive dis- 
tributions to match for all data. 

The entropic distribution for a probability vector p is sometimes used in 
the 'maximum entropy' image reconstruction community. 

P(p|a,m) = 1 exp[-a£>KL(p||m)] S^Pi - 1) , (23.35) 

where m, the measure, is a positive vector, and DKL(p||m) = J2iPi^°SPi/ m i- 

Further reading 

See (MacKay and Peto, 1995) for fun with Dirichlets. 

► 23.6 Further exercises 

Exercise 23.4. ^ datapoints {x n } are drawn from a gamma distribution 
P(x I s, c) = r(a;; s, c) with unknown parameters s and c. What are the 
maximum likelihood parameters s and c? 
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24 



Exact Marginalization 



How can we avoid the exponentially large cost of complete enumeration of 
all hypotheses? Before we stoop to approximate methods, we explore two 
approaches to exact marginalization: first, marginalization over continuous 
variables (sometimes known as nuisance parameters) by doing integrals; and 
second, summation over discrete variables by message-passing. 

Exact marginalization over continuous parameters is a macho activity en- 
joyed by those who are fluent in definite integration. This chapter uses gamma 
distributions; as was explained in the previous chapter, gamma distributions 
are a lot like Gaussian distributions, except that whereas the Gaussian goes 
from — oo to oo, gamma distributions go from to oo. 

► 24.1 Inferring the mean and variance of a Gaussian distribution 

We discuss again the one-dimensional Gaussian distribution, parameterized 
by a mean \i and a standard deviation a: 



P(x \n,a) = -=- exp - l 2 W = Normal(x; u, a 2 ). (24.1) 



When inferring these parameters, we must specify their prior distribution. 
The prior gives us the opportunity to include specific knowledge that we have 
about [i and a (from independent experiments, or on theoretical grounds, for 
example). If we have no such knowledge, then we can construct an appropriate 
prior that embodies our supposed ignorance. In section 21.2, we assumed a 
uniform prior over the range of parameters plotted. If we wish to be able to 
perform exact marginalizations, it may be useful to consider conjugate priors; 
these are priors whose functional form combines naturally with the likelihood 
such that the inferences have a convenient form. 

Conjugate priors for /j, and a 

The conjugate prior for a mean fi is a Gaussian: we introduce two 'hy- 
perparameters', /io and cr^, which parameterize the prior on /!, and write 
P(u | /irj, o>) = Normal(/i; /io, cr^). In the limit /irj = 0, — > oo, we obtain 
the noninformative prior for a location parameter, the flat prior. This is 
noninformative because it is invariant under the natural reparameterization 
y! = u + c. The prior P(n) = const, is also an improper prior, that is, it is not 
normalizable. 

The conjugate prior for a standard deviation a is a gamma distribution, 
which has two parameters bp and cp. It is most convenient to define the prior 




1 
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density of the inverse variance (the precision parameter) (3 = 1/cr 2 : 



P{(3) = T{(3-b p ,cp) 



1 f_p_ 



< P < oo. (24.2) 



This is a simple peaked distribution with mean bpcp and variance b\cp. In 
the limit bpcp = l,cp — > 0, we obtain the noninformative prior for a scale 
parameter, the 1/cr prior. This is 'noninformative' because it is invariant 
under the reparameterization cr' = ca. The 1/cr prior is less strange- looking if 
we examine the resulting density over In a, or ln/3, which is flat. This is the 
prior that expresses ignorance about cr by saying 'well, it could be 10, or it 
could be 1, or it could be 0.1, . . . ' Scale variables such as cr are usually best 
represented in terms of their logarithm. Again, this noninformative 1/cr prior 
is improper. 

In the following examples, I will use the improper noninformative priors 
for /i and cr. Using improper priors is viewed as distasteful in some circles, 
so let me excuse myself by saying it's for the sake of readability; if I included 
proper priors, the calculations could still be done but the key points would be 
obscured by the flood of extra parameters. 

Maximum likelihood and marginalization: a N and a N _ x 

The task of inferring the mean and standard deviation of a Gaussian distribu- 
tion from N samples is a familiar one, though maybe not everyone understands 
the difference between the <j n and a N _! buttons on their calculator. Let us 
recap the formulae, then derive them. 

Given data D = {x n }^ =1 , an 'estimator' of /i is 



2—in= 



(24.3) 



Reminder: when we change 
variables from a to 1(a), a 
one-to-one function of cr, the 
probability density transforms 
from Pcr(u) to 



W) = pA<t) 

Here, the Jacobian is 
da 



dl 



din a 



and two estimators of a are: 



a N 



N 



Dn=l( a 



and 



xf 



N- 1 



(24.4) 



There are two principal paradigms for statistics: sampling theory and Bayesian 
inference. In sampling theory (also known as 'frequentist' or orthodox statis- 
tics), one invents estimators of quantities of interest and then chooses between 
those estimators using some criterion measuring their sampling properties; 
there is no clear principle for deciding which criterion to use to measure the 
performance of an estimator; nor, for most criteria, is there any systematic 
procedure for the construction of optimal estimators. In Bayesian inference, 
in contrast, once we have made explicit all our assumptions about the model 
and the data, our inferences are mechanical. Whatever question we wish to 
pose, the rules of probability theory give a unique answer which consistently 
takes into account all the given information. Human-designed estimators and 
confidence intervals have no role in Bayesian inference; human input only en- 
ters into the important tasks of designing the hypothesis space (that is, the 
specification of the model and all its probability distributions), and figuring 
out how to do the computations that implement inference in that space. The 
answers to our questions are probability distributions over the quantities of 
interest. We often find that the estimators of sampling theory emerge auto- 
matically as modes or means of these posterior distributions when we choose 
a simple hypothesis space and turn the handle of Bayesian inference. 
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(d) 



0.8 1 1.2 1.4 1.61.8 2 




0.4 0.6 0.8 1 1.2 1.4 1.61.8 2 



In sampling theory, the estimators above can be motivated as follows, x is 
an unbiased estimator of [i which, out of all the possible unbiased estimators 
of (i, has smallest variance (where this variance is computed by averaging over 
an ensemble of imaginary experiments in which the data samples are assumed 
to come from an unknown Gaussian distribution). The estimator (x,a N ) is the 
maximum likelihood estimator for (u,, a). The estimator a N is biased, however: 
the expectation of a N , given a, averaging over many imagined experiments, is 
not a. 

Exercise 24. l.^ 2 ' P- 323 l Give an intuitive explanation why the estimator a N is 
biased. 

This bias motivates the invention, in sampling theory, of a N ^, which can be 
shown to be an unbiased estimator. Or to be precise, it is a%_^ that is an 
unbiased estimator of a 2 . 

We now look at some Bayesian inferences for this problem, assuming non- 
informative priors for u, and a. The emphasis is thus not on the priors, but 
rather on (a) the likelihood function, and (b) the concept of marginalization. 
The joint posterior probability of /x and a is proportional to the likelihood 
function illustrated by a contour plot in figure 24.1a. The log likelihood is: 

lnP({x n }% =1 \u.,a) = -Nln(V2^a)-Y,^n-^) 2 /(2a 2 ), (24.5) 



Figure 24.1. The likelihood 
function for the parameters of a 
Gaussian distribution, repeated 
from figure 21.5. 

(al, a2) Surface plot and contour 
plot of the log likelihood as a 
function of fi and a. The data set 
of N = 5 points had mean x = 1.0 
and S 2 = - x) 2 = 1.0. 
Notice that the maximum is skew 
in a. The two estimators of 
standard deviation have values 
a N = 0.45 and cr N _ L = 0.50. 

(c) The posterior probability of a 
for various fixed values of fi 
(shown as a density over lna"). 

(d) The posterior probability of a, 
P(a | D), assuming a flat prior on 
H, obtained by projecting the 
probability mass in (a) onto the a 
axis. The maximum of P(a \ D) is 
at (Tiv-i. By contrast, the 
maximum of P(a \D, fi = x) is at 
a N . (Both probabilities are shows 
as densities over In a. ) 



= -Nln{V2na)-[N{n-xy + S}/{2a 2 ) 1 (24.6) 

where S = ^2 n {x n — x) 2 . Given the Gaussian model, the likelihood can be 
expressed in terms of the two functions of the data x and S, so these two 
quantities are known as 'sufficient statistics'. The posterior probability of /i 
and a is, using the improper priors: 

P({x n }^ =1 \^<y)P(^a) 



P{n,o\{x n }% =l ) = 



exp i 



N(p-x) 2 +S 
2a 2 



^({*X=i) 



(24.7) 



(24.; 
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This function describes the answer to the question, 'given the data, and the 
noninformative priors, what might fi and a be?' It may be of interest to find 
the parameter values that maximize the posterior probability, though it should 
be emphasized that posterior probability maxima have no fundamental status 
in Bayesian inference, since their location depends on the choice of basis. Here 
we choose the basis (/u, In a), in which our prior is flat, so that the posterior 
probability maximum coincides with the maximum of the likelihood. As we 
saw in exercise 22.4 (p. 302), the maximum likelihood solution for [i and lncr 

is {p,(t}ml = {x,a N = \/S/n} . 

There is more to the posterior distribution than just its mode. As can 
be seen in figure 24.1a, the likelihood has a skew peak. As we increase cr, 
the width of the conditional distribution of fi increases (figure 22.1b). And 
if we fix n to a sequence of values moving away from the sample mean x, we 
obtain a sequence of conditional distributions over a whose maxima move to 
increasing values of cr (figure 24.1c). 

The posterior probability of fi given a is 



" \ \ X nS n =l I ° I 



oc exp{-N{fi- x) 2 /{2a 2 )) (24.10) 
= Normal^; x,a 2 /N). (24.11) 

We note the familiar o j\f~N scaling of the error bars on fi. 

Let us now ask the question 'given the data, and the noninformative priors, 
what might cr be?' This question differs from the first one we asked in that we 
are now not interested in fi. This parameter must therefore be marginalized 
over. The posterior probability of a is: 



The data-dependent term P({x n }^ =1 \ cr) appeared earlier as the normalizing 
constant in equation (24.9); one name for this quantity is the 'evidence', or 
marginal likelihood, for cr. We obtain the evidence for cr by integrating out 
fx; a noninformative prior P(/z) = constant is assumed; we call this constant 
1/0"^, so that we can think of the prior as a top-hat prior of width cr^. The 
Gaussian integral, P({x n }% =1 1 cr) = / P({x n }% =1 \ fi,a)P(ii) d/i, yields: 

lnP({x n }% =1 \a) = -Nln(V2^a) --^ + ln ^ / ^ . (24.13) 

The first two terms are the best fit log likelihood (i.e., the log likelihood with 
/i = x). The last term is the log of the Occam factor which penalizes smaller 
values of cr. (We will discuss Occam factors more in Chapter 28.) When we 
differentiate the log evidence with respect to In cr, to find the most probable 
cr, the additional volume factor (cr / 'y/~N) shifts the maximum from cr N to 

a N _, = y/S/(N-l). (24.14) 

Intuitively, the denominator (N—l) counts the number of noise measurements 
contained in the quantity S = z\^ n {x n — x) 2 . The sum contains N residuals 
squared, but there are only (N—l) effective noise measurements because the 
determination of one parameter /i from the data causes one dimension of noise 
to be gobbled up in unavoidable overfitting. In the terminology of classical 
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statistics, the Bayesian's best guess for a sets x 2 (the measure of deviance 
defined by \ 2 = J2 n ( x n ~ f 1 ) 2 /^ 2 ) equal to the number of degrees of freedom, 
N — 1. 

Figure 24. Id shows the posterior probability of cr, which is proportional 
to the marginal likelihood. This may be contrasted with the posterior prob- 
ability of a with fj, fixed to its most probable value, 5 = 1, which is shown in 
figure 24.1c and d. 

The final inference we might wish to make is 'given the data, what is /!?' 

> Exercise 24.2. ^ Marginalize over a and obtain the posterior marginal distri- 
bution of /I, which is a Student-t distribution: 

P(ji\D) oc l/(N{n-x) 2 + S) N/2 . (24.15) 

Further reading 

A bible of exact marginalization is Bretthorst's (1988) book on Bayesian spec- 
trum analysis and parameter estimation. 

► 24.2 Exercises 

t> Exercise 24.3. [3] [This exercise requires macho integration capabilities.] Give 
a Bayesian solution to exercise 22.15 (p. 309), where seven scientists of 
varying capabilities have measured fi with personal noise levels a n , 
and we are interested in inferring fi. Let the prior on each a n be a 
broad prior, for example a gamma distribution with parameters (s,c) = 
(10,0.1). Find the posterior distribution of /i. Plot it, and explore its 
properties for a variety of data sets such as the one given, and the data 
set {x n } = {13.01,7.39}. 

[Hint: first find the posterior distribution of a n given fi and x n , 
P{&n \ x n ,/J,). Note that the normalizing constant for this inference is 
P(x n | n). Marginalize over a n to find this normalizing constant, then 
use Bayes' theorem a second time to find P(fi \ {x n }).] 

► 24.3 Solutions 

Solution to exercise 24.1 (p. 321). 1. The data points are distributed with mean 
squared deviation a 2 about the true mean. 2. The sample mean is unlikely 
to exactly equal the true mean. 3. The sample mean is the value of \x that 
minimizes the sum squared deviation of the data points from \x. Any other 
value of fi (in particular, the true value of /i) will have a larger value of the 
sum-squared deviation that fi = x. 

So the expected mean squared deviation from the sample mean is neces- 
sarily smaller than the mean squared deviation a 2 about the true mean. 
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25 



Exact Margin&liz&tion in Trellises 



In this chapter we will discuss a few exact methods that are used in proba- 
bilistic modelling. As an example we will discuss the task of decoding a linear 
error-correcting code. We will see that inferences can be conducted most effi- 
ciently by message-passing algorithms, which take advantage of the graphical 
structure of the problem to avoid unnecessary duplication of computations 
(see Chapter 16). 

► 25.1 Decoding problems 

A codeword t is selected from a linear (N, K) code C, and it is transmitted 
over a noisy channel; the received signal is y. In this chapter we will assume 
that the channel is a memoryless channel such as a Gaussian channel. Given 
an assumed channel model P(y | t), there are two decoding problems. 

The codeword decoding problem is the task of inferring which codeword 
t was transmitted given the received signal. 

The bitwise decoding problem is the task of inferring for each transmit- 
ted bit t n how likely it is that that bit was a one rather than a zero. 

As a concrete example, take the (7,4) Hamming code. In Chapter 1, we 
discussed the codeword decoding problem for that code, assuming a binary 
symmetric channel. We didn't discuss the bitwise decoding problem and we 
didn't discuss how to handle more general channel models such as a Gaussian 
channel. 

Solving the codeword decoding problem 

By Bayes' theorem, the posterior probability of the codeword t is 



Likelihood function. The first factor in the numerator, P(y 1 1), is the likeli- 
hood of the codeword, which, for any memoryless channel, is a separable 



^(t|y) = 



P(y\t)P(t) 
P(y) 



(25.1) 



function, 



N 



P(y\t) = \\P{yn\t n ). 



(25.2) 



n=l 



For example, if the channel is a Gaussian channel with transmissions ±x 
and additive noise of standard deviation cr, then the probability density 
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25.1: Decoding problems 



of the received signal y n in the two cases t n = 0, 1 is 

pfe " |t " =1) " vlp exp (-^) <25 - 3) 

Pta.lft.-0) - ^exp(-<^). (25.4, 

From the point of view of decoding, all that matters is the likelihood 
ratio, which for the case of the Gaussian channel is 



P(y n \t n = l) 
P(y n \t n = o) 




(25.5) 



Exercise 25.1.'- ' Show that from the point of view of decoding, a Gaussian 
channel is equivalent to a time- varying binary symmetric channel with 
a known noise level /„ which depends on n. 

Prior. The second factor in the numerator is the prior probability of the 
codeword, P(t), which is usually assumed to be uniform over all valid 
codewords. 

The denominator in (25.1) is the normalizing constant 

^(y) = E p (yi t ) p ( t )- ( 25 - 6 ) 



The complete solution to the codeword decoding problem is a list of all 
codewords and their probabilities as given by equation (25.1). Since the num- 
ber of codewords in a linear code, 2 K , is often very large, and since we are not 
interested in knowing the detailed probabilities of all the codewords, we often 
restrict attention to a simplified version of the codeword decoding problem. 

The MAP codeword decoding problem is the task of identifying the 
most probable codeword t given the received signal. 

If the prior probability over codewords is uniform then this task is iden- 
tical to the problem of maximum likelihood decoding, that is, identifying 
the codeword that maximizes P(y | t). 

Example: In Chapter 1, for the (7,4) Hamming code and a binary symmetric 
channel we discussed a method for deducing the most probable codeword from 
the syndrome of the received signal, thus solving the MAP codeword decoding 
problem for that case. We would like a more general solution. 

The MAP codeword decoding problem can be solved in exponential time 
(of order 2 K ) by searching through all codewords for the one that maximizes 
P(y|t)P(t). But we are interested in methods that are more efficient than 
this. In section 25.3, we will discuss an exact method known as the min-sum 
algorithm which may be able to solve the codeword decoding problem more 
efficiently; how much more efficiently depends on the properties of the code. 

It is worth emphasizing that MAP codeword decoding for a general lin- 
ear code is known to be NP-complete (which means in layman's terms that 
MAP codeword decoding has a complexity that scales exponentially with the 
blocklength, unless there is a revolution in computer science). So restrict- 
ing attention to the MAP decoding problem hasn't necessarily made the task 
much less challenging; it simply makes the answer briefer to report. 
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Solving the bitwise decoding problem 



Formally, the exact solution of the bitwise decoding problem is obtained from 
equation (25.1) by marginalizing over the other bits. 



P(t n \y)= E P(t\y). 



(25.7) 



We can also write this marginal with the aid of a truth function 1[S] that is 
one if the proposition S is true and zero otherwise. 



P(t n = l\y) = £)P(t|y)l[t n = l] 
t 

P(t n =o\y) = ^P(t|y)l[t n = 0]. 



(25.8) 
(25.9) 



Computing these marginal probabilities by an explicit sum over all codewords 
t takes exponential time. But, for certain codes, the bitwise decoding problem 
can be solved much more efficiently using the forward-backward algorithm. 
We will describe this algorithm, which is an example of the sum-product 
algorithm, in a moment. Both the min-sum algorithm and the sum-product 
algorithm have widespread importance, and have been invented many times 
in many fields. 

► 25.2 Codes and trellises 

In Chapters 1 and 11, we represented linear (N,K) codes in terms of their 
generator matrices and their parity-check matrices. In the case of a systematic 
block code, the first K transmitted bits in each block of size N are the source 
bits, and the remaining M = N — K bits are the parity-check bits. This means 
that the generator matrix of the code can be written 



G T 



Ik 
P 



and the parity-check matrix can be written 



H 



M 



where P is an M x K matrix. 

In this section we will study another representation of a linear code called a 
trellis. The codes that these trellises represent will not in general be systematic 
codes, but they can be mapped onto systematic codes if desired by a reordering 
of the bits in a block. 



(a) 




Repetition code R3 



(b) 




Simple parity code P3 



(25.10) (c) 



(25.11) 




(7, 4) Hamming code 

Figure 25.1. Examples of trellises. 
Each edge in a trellis is labelled 
by a zero (shown by a square) or 
a one (shown by a cross). 



Definition of a trellis 

Our definition will be quite narrow. For a more comprehensive view of trellises, 
the reader should consult Kschischang and Sorokine (1995). 

A trellis is a graph consisting of nodes (also known as states or vertices) and 
edges. The nodes are grouped into vertical slices called times, and the 
times are ordered such that each edge connects a node in one time to 
a node in a neighbouring time. Every edge is labelled with a symbol. 
The leftmost and rightmost states contain only one node. Apart from 
these two extreme nodes, all nodes in the trellis have at least one edge 
connecting leftwards and at least one connecting rightwards. 
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25.3: Solving the decoding problems on a trellis 

A trellis with N+l times defines a code of blocklength N as follows: a 
codeword is obtained by taking a path that crosses the trellis from left to right 
and reading out the symbols on the edges that are traversed. Each valid path 
through the trellis defines a codeword. We will number the leftmost time 'time 
0' and the rightmost 'time N\ We will number the leftmost state 'state 0' 
and the rightmost 'state P, where / is the total number of states (vertices) in 
the trellis. The nth bit of the codeword is emitted as we move from time n— 1 
to time n. 

The width of the trellis at a given time is the number of nodes in that 
time. The maximal width of a trellis is what it sounds like. 

A trellis is called a linear trellis if the code it defines is a linear code. We will 
solely be concerned with linear trellises from now on, as nonlinear trellises are 
much more complex beasts. For brevity, we will only discuss binary trellises, 
that is, trellises whose edges are labelled with zeroes and ones. It is not hard 
to generalize the methods that follow to q-aiy trellises. 

Figures 25.1(a-c) show the trellises corresponding to the repetition code 
7? 3 which has (N, K) = (3,1); the parity code P 3 with (N, K) = (3,2); and 
the (7, 4) Hamming code. 

> Exercise 25.2.^ Confirm that the sixteen codewords listed in table 1.14 are 
generated by the trellis shown in figure 25.1c. 

Observations about linear trellises 

For any linear code the minimal trellis is the one that has the smallest number 
of nodes. In a minimal trellis, each node has at most two edges entering it and 
at most two edges leaving it. All nodes in a time have the same left degree as 
each other and they have the same right degree as each other. The width is 
always a power of two. 

A minimal trellis for a linear (TV, K) code cannot have a width greater than 
2 K since every node has at least one valid codeword through it, and there are 
only 2 K codewords. Furthermore, if we define M = N — K, the minimal 
trellis's width is everywhere less than 2 M . This will be proved in section 25.4. 

Notice that for the linear trellises in figure 25.1, all of which are minimal 
trellises, K is the number of times a binary branch point is encountered as the 
trellis is traversed from left to right or from right to left. 

We will discuss the construction of trellises more in section 25.4. But we 
now know enough to discuss the decoding problem. 

► 25.3 Solving the decoding problems on a trellis 

We can view the trellis of a linear code as giving a causal description of the 
probabilistic process that gives rise to a codeword, with time flowing from left 
to right. Each time a divergence is encountered, a random source (the source 
of information bits for communication) determines which way we go. 

At the receiving end, we receive a noisy version of the sequence of edge- 
labels, and wish to infer which path was taken, or to be precise, (a) we want 
to identify the most probable path in order to solve the codeword decoding 
problem; and (b) we want to find the probability that the transmitted symbol 
at time n was a zero or a one, to solve the bitwise decoding problem. 

Example 25.3. Consider the case of a single transmission from the Hamming 
(7,4) trellis shown in figure 25.1c. 
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Likelihood Posterior probability 



0000000 
0001011 
0010111 
0011100 
0100110 
0101101 
0110001 
0111010 
1000101 
1001110 
1010010 
1011001 
1100011 
1101000 
1110100 

1111111 



0.0275562 
0.0001458 
0.0013122 
0.0030618 
0.0002268 
0.0000972 
0.0708588 
0.0020412 
0.0001458 
0.0000042 
0.0030618 
0.0013122 
0.0000972 
0.0002268 
0.0020412 
0.0000108 



0.25 

0.0013 

0.012 

0.027 

0.0020 

0.0009 

0.63 

0.018 

0.0013 

0.0000 

0.027 

0.012 

0.0009 

0.0020 

0.018 

0.0001 



Figure 25.2. Posterior probabilities 
over the sixteen codewords when 
the received vector y has 
normalized likelihoods 
(0.1,0.4,0.9,0.1,0.1,0.1,0.3). 



Let the normalized likelihoods be: (0.1,0.4,0.9,0.1,0.1,0.1,0.3). That is, 
the ratios of the likelihoods are 

P(yi|£i = l) _ 0.1 P{V2\x 2 = l) _ 0.4 

P( yi |^ = 0)-0.9' P(y 2 |x 2 = 0)-0.6' ^ [2bA2) 

How should this received signal be decoded? 



1. If we threshold the likelihoods at 0.5 to turn the signal into a bi- 
nary received vector, we have r = (0,0,1,0,0,0,0), which decodes, 
using the decoder for the binary symmetric channel (Chapter 1), into 
t = (0,0,0,0,0,0,0). 

This is not the optimal decoding procedure. Optimal inferences are 
always obtained by using Bayes' theorem. 

2. We can find the posterior probability over codewords by explicit enu- 
meration of all sixteen codewords. This posterior distribution is shown 
in figure 25.2. Of course, we aren't really interested in such brute-force 
solutions, and the aim of this chapter is to understand algorithms for 
getting the same information out in less than 2 K computer time. 

Examining the posterior probabilities, we notice that the most probable 
codeword is actually the string t = 0110001. This is more than twice as 
probable as the answer found by thresholding, 0000000. 

Using the posterior probabilities shown in figure 25.2, we can also com- 
pute the posterior marginal distributions of each of the bits. The result 
is shown in figure 25.3. Notice that bits 1, 4, 5 and 6 are all quite con- 
fidently inferred to be zero. The strengths of the posterior probabilities 
for bits 2, 3, and 7 are not so great. □ 



In the above example, the MAP codeword is in agreement with the bitwise 
decoding that is obtained by selecting the most probable state for each bit 
using the posterior marginal distributions. But this is not always the case, as 
the following exercise shows. 
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n Likelihood 

P(Vn\t n = l) P{yn\t n = 0) P{t n - 



Posterior marginals 
:l|y) P(t n = 0|y) 



1 


0.1 


0.9 


0.061 


□ 


2 


0.4 


0.6 


0.674 


1 


3 


0.9 


0.1 


0.746 


1 


4 


0.1 


0.9 


0.061 


□ 


5 


0.1 


0.9 


0.061 


□ 


6 


0.1 


0.9 


0.061 


□ 


7 


0.3 


0.7 


0.659 


1 



0.939 
0.326 
0.254 
0.939 
0.939 
0.939 
0.341 



Figure 25.3. Marginal posterior 
probabilities for the 7 bits under 
the posterior distribution of 
figure 25.2. 



Exercise 25.4.^ ' p ' ' Find the most probable codeword in the case where 
the normalized likelihood is (0.2,0.2,0.9,0.2,0.2,0.2,0.2). Also find or 
estimate the marginal posterior probability for each of the seven bits, 
and give the bit-by-bit decoding. 

[Hint: concentrate on the few codewords that have the largest probabil- 
ity] 

We now discuss how to use message passing on a code's trellis to solve the 
decoding problems. 

The min-sum algorithm 

The MAP codeword decoding problem can be solved using the min-sum al- 
gorithm that was introduced in section 16.3. Each codeword of the code 
corresponds to a path across the trellis. Just as the cost of a journey is the 
sum of the costs of its constituent steps, the log likelihood of a codeword is 
the sum of the bitwise log likelihoods. By convention, we flip the sign of the 
log likelihood (which we would like to maximize) and talk in terms of a cost, 
which we would like to minimize. 

We associate with each edge a cost — logP(y„ \ t n ), where t n is the trans- 
mitted bit associated with that edge, and y n is the received symbol. The 
min-sum algorithm presented in section 16.3 can then identify the most prob- 
able codeword in a number of computer operations equal to the number of 
edges in the trellis. This algorithm is also known as the Viterbi algorithm 
(Viterbi, 1967). 

The sum-product algorithm 

To solve the bitwise decoding problem, we can make a small modification to 
the min-sum algorithm, so that the messages passed through the trellis define 
'the probability of the data up to the current point' instead of 'the cost of the 
best route to this point'. We replace the costs on the edges, —log P{y n | t n ), by 
the likelihoods themselves, P(y n \ t n ). We replace the min and sum operations 
of the min-sum algorithm by a sum and product respectively. 

Let i run over nodes/states, i = be the label for the start state, V(i) 
denote the set of states that are parents of state i, and Wij be the likelihood 
associated with the edge from node j to node i. We define the forward-pass 
messages a, by 



"0 = 1 
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cti = w ij a j- (25.13) 

jev(i) 

These messages can be computed sequentially from left to right. 

> Exercise 25.5. ^ Show that for a node i whose time-coordinate is n, on is 

proportional to the joint probability that the codeword's path passed 
through node i and that the first n received symbols were yi, . . . , y n . 

The message aj computed at the end node of the trellis is proportional to the 
marginal probability of the data. 

> Exercise 25.6. ^ What is the constant of proportionality? [Answer: 2 K ] 

We define a second set of backward-pass messages in a similar manner. 
Let node / be the end node. 

Pi = 1 

Pi = Z) Wi i&- (25J4) 
r.jev(i) 

These messages can be computed sequentially in a backward pass from right 
to left. 

> Exercise 25.7.^ Show that for a node i whose time-coordinate is n, Pi is 

proportional to the conditional probability, given that the codeword's 
path passed through node i, that the subsequent received symbols were 

Vn+l ■■■UN- 

Finally, to find the probability that the nth bit was a 1 or 0, we do two 
summations of products of the forward and backward messages. Let i run over 
nodes at time n and j run over nodes at time n — 1 , and let t{j be the value 
of t n associated with the trellis edge from node j to node i. For each value of 
t = 0/1, we compute 

r n= ajWijPi. (25.15) 

Then the posterior probability that t n was t = 0/1 is 

P{tn = t\y) = \4\ (25.16) 

where the normalizing constant Z = rffl + r« should be identical to the final 
forward message ai that was computed earlier. 

Exercise 25.8. Confirm that the above sum-product algorithm does com- 
pute P(t n = t\y). 

Other names for the sum-product algorithm presented here are 'the forward- 
backward algorithm', 'the BCJR algorithm', and 'belief propagation'. 

> Exercise 25.9. ^' p ' 333 l A codeword of the simple parity code P3 is transmitted, Table 25.4. Bitwise likelihoods for 

and the received signal y has associated likelihoods shown in table 25.4. a codeword of P3. 

Use the min-sum algorithm and the sum-product algorithm in the trellis 

(figure 25.1) to solve the MAP codeword decoding problem and the 

bitwise decoding problem. Confirm your answers by enumeration of 

all codewords (000, Oil, 110, 101). [Hint: use logs to base 2 and do 

the min-sum computations by hand. When working the sum-product 

algorithm by hand, you may find it helpful to use three colours of pen, 

one for the as, one for the ws, and one for the /?s.] 



n 


P(Vn 


\tn) 




t n = 


*n = l 


1 


V4 


1/2 


2 


V2 


V4 


3 


Vs 


1/2 
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► 25.4 More on trellises 

We now discuss various ways of making the trellis of a code. You may safely 
jump over this section. 

The span of a codeword is the set of bits contained between the first bit in 
the codeword that is non-zero, and the last bit that is non-zero, inclusive. We 
can indicate the span of a codeword by a binary vector as shown in table 25.5. 



Codeword 0000000 0001011 0100110 1100011 0101101 
Span 0000000 0001111 0111110 1111111 0111111 



Table 25.5. Some codewords and 
their spans. 



A generator matrix is in trellis-oriented form if the spans of the rows of the 
generator matrix all start in different columns and the spans all end in different 
columns. 

How to make a trellis from a generator matrix 

First, put the generator matrix into trellis- oriented form by row-manipulations 
similar to Gaussian elimination. For example, our (7,4) Hamming code can 
be generated by 

" 1 1 1 " 
10 110 
10 111 
1 1 1 



G 



(25.17) 



but this matrix is not in trellis-oriented form - for example, rows 1, 3 and 4 
all have spans that end in the same column. By subtracting lower rows from 
upper rows, we can obtain an equivalent generator matrix (that is, one that 
generates the same set of codewords) as follows: 



110 10 
10 110 
1110 
1 1 1 



(25.18) 



Now, each row of the generator matrix can be thought of as defining an 
(N, 1) subcode of the (N,K) code, that is, in this case, a code with two 
codewords of length N = 7. For the first row, the code consists of the two 
codewords 1101000 and 0000000. The subcode defined by the second row 
consists of 0100110 and 0000000. It is easy to construct the minimal trellises 
of these subcodes; they are shown in the left column of figure 25.6. 

We build the trellis incrementally as shown in figure 25.6. We start with 
the trellis corresponding to the subcode given by the first row of the generator 
matrix. Then we add in one subcode at a time. The vertices within the span 
of the new subcode are all duplicated. The edge symbols in the original trellis 
are left unchanged and the edge symbols in the second part of the trellis are 
flipped wherever the new subcode has a 1 and otherwise left alone. 

Another (7, 4) Hamming code can be generated by 



G 



1 1 1 

11110 

10 110 

1 1 1 1 



(25.19) 
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Figure 25.6. Trellises for four- 
subcodes of the (7, 4) Hamming 
code (left column), and the 
sequence of trellises that arc made 
when constructing the trellis for 
the (7, 4) Hamming code (right 
column) . 

Each edge in a trellis is labelled 
by a zero (shown by a square) or 
a one (shown by a cross). 



The (7, 4) Hamming code generated by this matrix differs by a permutation 
of its bits from the code generated by the systematic matrix used in Chapter 
1 and above. The parity-check matrix corresponding to this permutation is: 



H 



10 10 10 1 
110 11 
1 1 1 1 



(25.20) 



The trellis obtained from the permuted matrix G given in equation (25.19) is 
shown in figure 25.7a. Notice that the number of nodes in this trellis is smaller 
than the number of nodes in the previous trellis for the Hamming (7, 4) code 
in figure 25.1c. We thus observe that rearranging the order of the codeword 
bits can sometimes lead to smaller, simpler trellises. 

Trellises from parity-check matrices 

Another way of viewing the trellis is in terms of the syndrome. The syndrome 
of a vector r is defined to be Hr, where H is the parity-check matrix. A vector 
is only a codeword if its syndrome is zero. As we generate a codeword we can 
describe the current state by the partial syndrome, that is, the product of 
H with the codeword bits thus far generated. Each state in the trellis is a 
partial syndrome at one time coordinate. The starting and ending states are 
both constrained to be the zero syndrome. Each node in a state represents a 
different possible value for the partial syndrome. Since HisanMxiV matrix, 
where M = N — K , the syndrome is at most an M-bit vector. So we need at 
most 2 M nodes in each state. We can construct the trellis of a code from its 
parity-check matrix by walking from each end, generating two trees of possible 
syndrome sequences. The intersection of these two trees defines the trellis of 
the code. 

In the pictures we obtain from this construction, we can let the vertical 
coordinate represent the syndrome. Then any horizontal edge is necessarily 
associated with a zero bit (since only a non-zero bit changes the syndrome) 




Figure 25.7. Trellises for the 
permuted (7, 4) Hamming code 
generated from (a) the generator 
matrix by the method of 
figure 25.6; (b) the parity-check 
matrix by the method on page 
332. 

Each edge in a trellis is labelled 
by a zero (shown by a square) or 
a one (shown by a cross). 
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and any non- horizontal edge is associated with a one bit. (Thus in this rep- 
resentation we no longer need to label the edges in the trellis.) Figure 25.7b 
shows the trellis corresponding to the parity-check matrix of equation (25.20). 



► 25.5 Solutions 



Table 25.8. The posterior 

t Likelihood Posterior probability probability over codewords for 



0000000 


0.026 


0.3006 


1 1 


0001011 


0.00041 


0.0047 


1 


0010111 


0.0037 


0.0423 


□ 


0011100 


0.015 


0.1691 


□ 


0100110 


0.00041 


0.0047 


1 


0101101 


0.00010 


0.0012 


1 


0110001 


0.015 


0.1691 


□ 


0111010 


0.0037 


0.0423 


□ 


1000101 


0.00041 


0.0047 


1 


1001110 


0.00010 


0.0012 


1 


1010010 


0.015 


0.1691 


□ 


1011001 


0.0037 


0.0423 


□ 


1100011 


0.00010 


0.0012 


1 


1101000 


0.00041 


0.0047 


1 


1110100 


0.0037 


0.0423 


D 


1111111 


0.000058 


0.0007 


1 



Solution to exercise 25.4 (p. 329). The posterior probability over codewords is 
shown in table 25.8. The most probable codeword is 0000000. The marginal 
posterior probabilities of all seven bits are: 



n 




Likelihood 




Posterior marginals 




P(Un I t„ 


= 1) P(y n \t n = 0) 


P(t n 


= l|y) P(t n = 0\y) 


1 


0.2 


0.8 


0.266 I 


1 0.734 1 1 


2 


0.2 


0.8 


0.266 I 


=1 0.734 1 1 


3 


0.9 


0.1 


0.677 1 


1 0.323 l 1 


4 


0.2 


0.8 


0.266 1 


=1 0.734 1 1 


5 


0.2 


0.8 


0.266 I 


=1 0.734 1 1 


6 


0.2 


0.8 


0.266 r 


=1 0.734 l 1 


7 


0.2 


0.8 


0.266 r 


=1 0.734 l 1 



So the bitwise decoding is 0010000, which is not actually a codeword. 

Solution to exercise 25.9 (p. 330). The MAP codeword is 101, and its like- 
lihood is 1/8. The normalizing constant of the sum-product algorithm is 
Z = aj = 3 /l6. The intermediate a, are (from left to right) V2, 1 k, 5 /l6, 4 /l6; 
the intermediate are (from right to left), V2, Vs, 9 /32, 3 /l6. The bitwise 
decoding is: = 1 | y) = 3/4; P(tx = 1 1 y) = 1/4; = 1 | y) = 5/6. The 
codewords' probabilities are Vl2, 2 /l2, Vl2, 8 /l2 for 000, Oil, 110, 101. 
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26 



Exact Marginalization in Graphs 



We now take a more general view of the tasks of inference and marginalization. 
Before reading this chapter, you should read about message passing in Chapter 
16. 

► 26.1 The general problem 

Assume that a function P* of a set of N variables x = {x n }^ =1 is defined as 
a product of M factors as follows: 

M 

P*(x) = n / m (x m ). (26.1) 

m=l 

Each of the factors / m (x m ) is a function of a subset x m of the variables that 
make up x. If P* is a positive function then we may be interested in a second 
normalized function, 

M 

P(x) = |P*(x) = i n / m (x m ), (26.2) 

m=l 

where the normalizing constant Z is defined by 

M 

Z = Y. II /™( x ™)- ( 26 - 3 ) 

x m=l 

As an example of the notation we've just introduced, here's a function of 
three binary variables x\, X2, x 3 defined by the five factors: 

0.1 xi=0 
0.9 xi = 1 
0.1 a; 2 = 
0.9 x 2 = l 
0.9 x 3 =0 
0.1 x 3 = l 

1 (0:1,12) = (0,0) or (1,1) ( 26 - 4 ) 

(a:i,x 2 ) = (l,0) or (0,1) 

1 (0:2,13) = (0,0) or (1,1) 
(x 2 ,x 3 ) = (l,0) or (0,1) 



The five subsets of {x\, x 2 , x%} denoted by x m in the general function (26.1) 
are here x x = {a^}, x 2 = {x 2 }, x 3 = {x 3 }, x 4 = {x x ,x 2 }, and x 5 = {x 2 ,x 3 }. 



fi(xi) 


- { 


h{x2) 


- { 


h(x 3 ) 


- { 


fi{x\,x 2 ) 


- { 


k{x 2 ,x 3 ) 


- { 


P*(x) 
P(x) 


= h 

1 

— z 
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26.1: The general problem 



Xi x 2 



The function -P(x), by the way, may be recognized as the posterior prob- 
ability distribution of the three transmitted bits in a repetition code (section 
1.2) when the received signal is r = (1, 1, 0) and the channel is a binary sym- 
metric channel with flip probability 0.1. The factors }\ and f$ respectively 
enforce the constraints that X\ and x 2 must be identical and that x 2 and x$ 
must be identical. The factors f\, f 2 , fs are the likelihood functions con- 
tributed by each component of r. 

A function of the factored form (26.1) can be depicted by a factor graph, in 
which the variables are depicted by circular nodes and the factors are depicted 
by square nodes. An edge is put between variable node n and factor node m Figure 26.1. The factor graph 
if the function / m (x m ) has any dependence on variable x n . The factor graph associated with the function 
for the example function (26.4) is shown in figure 26.1. P*W (26.4). 




□ W 

fi h /3 fi h 



The normalization problem 

The first task to be solved is to compute the normalizing constant Z. 



The marginalization problems 

The second task to be solved is to compute the marginal function of any 
variable x n , defined by 

z„(x n )= y, p *( x )- ( 26 - 5 ) 

{z n ,},n'/n 

For example, if / is a function of three variables then the marginal for 
n = 1 is defined by 

Z x {x x ) = Y f{xi,x 2 ,x 3 ). (26.6) 

2:2, £3 

This type of summation, over 'all the x n > except for x„ is so important that it 
can be useful to have a special notation for it - the 'not-sum' or 'summary'. 

The third task to be solved is to compute the normalized marginal of any 
variable x n , defined by 

P n (x n ) = ]T P(x). (26.7) 

{&„;}, n'#n 

[We include the suffix 'n' in P n (x n ), departing from our normal practice in the 
rest of the book, where we would omit it.] 

> Exercise 26.1. ^ Show that the normalized marginal is related to the marginal 

Z n {x n ) by 

P n {x n ) = (26.8) 

We might also be interested in marginals over a subset of the variables, 
such as 

Zi 2 {x\,x 2 ) = YP*(xi,x 2 ,X3). (26.9) 

All these tasks are intractable in general. Even if every factor is a function 
of only three variables, the cost of computing exact solutions for Z and for 
the marginals is believed in general to grow exponentially with the number of 
variables N. 

For certain functions P* , however, the marginals can be computed effi- 
ciently by exploiting the factorization of P*. The idea of how this efficiency 
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26 — Exact Marginalization in Graphs 



arises is well illustrated by the message-passing examples of Chapter 16. The 
sum-product algorithm that we now review is a generalization of message- 
passing rule-set B (p. 242). As was the case there, the sum-product algorithm 
is only valid if the graph is tree-like. 

► 26.2 The sum-product algorithm 

Notation 

We identify the set of variables that the mth factor depends on, x m , by the set 
of their indices J\f(m). For our example function (26.4), the sets are jV(1) = 
{1} (since /j is a function of x x alone), 7V(2) = {2}, 7V(3) = {3}, 7V(4) = 
{1,2}, and A/"(5) = {2,3}. Similarly we define the set of factors in which 
variable n participates, by A4(n). We denote a set Af(m) with variable n 
excluded by Af(m)\n. We introduce the shorthand x m \n or x m \„ to denote 
the set of variables in x m with x n excluded, i.e., 



x m \n = {x n > : n G Af(m)\n}. 



(26.10) 



The sum-product algorithm will involve messages of two types passing 
along the edges in the factor graph: messages q n ^ m from variable nodes to 
factor nodes, and messages r m ^ n from factor nodes to variable nodes. A 
message (of either type, q or r) that is sent along an edge connecting factor 
f m to variable x n is always a function of the variable x n . 

Here are the two rules for the updating of the two sets of messages. 



From variable to factor: 



Qn^m(^n) — ^ j[ ^ m'— m^n) • 

m'eM(n)\m 



(26.11) 



From factor to variable: 



r m^n{ x n) — ^ ] J /m( x ra) Jj_ | qn'^m{ x n') 
x m \n \ n'eN{m)\n 



(26.12) 



How these rules apply to leaves in the factor graph 

A node that has only one edge connecting it to another node is called a leaf 
node. 

Some factor nodes in the graph may be connected to only one vari- 
able node, in which case the set J\f(m)\n of variables appearing in the fac- 
tor message update (26.12) is an empty set, and the product of functions 
T\n'eN(m)\n Qn'^m{ x n') is the empty product, whose value is 1. Such a fac- 
tor node therefore always broadcasts to its one neighbour x n the message 

^m^n(^-n) = fm{ x n)- 

Similarly, there may be variable nodes that are connected to only one 
factor node, so the set M(n)\m in (26.11) is empty. These nodes perpetually 
broadcast the message q n ^m(x„) = 1. 




I'm— *n(^n) fm{%n) 



Figure 26.2. A factor node that is 
a leaf node perpetually sends the 
message r m ^ n (x n ) = f m (x n ) to 
its one neighbour x n . 



Figure 26.3. A variable node that 
is a leaf node perpetually sends 
the message q n ^ m (x n ) = 1. 




Starting and finishing, method 1 

The algorithm can be initialized in two ways. If the graph is tree-like then 
it must have nodes that are leaves. These leaf nodes can broadcast their 
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26.2: The sum-product algorithm 



Xi x 2 




□ W 

h h fz fi fb 



messages to their respective neighbours from the start. 

For all leaf variable nodes n: q n ^m(x n ) = 1 (26.13) 
For all leaf factor nodes m: r m ^ n (x n ) = f m (x n ). (26.14) 

We can then adopt the procedure used in Chapter 16's message-passing rule- 
set B (p. 242): a message is created in accordance with the rules (26.11, 26.12) 
only if all the messages on which it depends are present. For example, in 
figure 26.4, the message from X\ to f\ will be sent only when the message 
from fi to X\ has been received; and the message from x 2 to f 2 , q 2 ^ 2 , can be 
sent only when the messages r^ 2 and r^ 2 have both been received. 

Messages will thus flow through the tree, one in each direction along every 
edge, and after a number of steps equal to the diameter of the graph, every Figure 26.4. Our model factor 
message will have been created. graph for the function P*(x) 

The answers we require can then be read out. The marginal function of 

(26.4). 

x n is obtained by multiplying all the incoming messages at that node. 

z n {x n ) = n 

(x n ). (26.15) 

m€.M{n) 

The normalizing constant Z can be obtained by summing any marginal 
function, Z = Yl Xn Zn{x n ), and the normalized marginals obtained from 

P n{Xn ) = ^M. (26.16) 

\> Exercise 26. 2 Apply the sum-product algorithm to the function defined in 
equation (26.4) and figure 26.1. Check that the normalized marginals 
are consistent with what you know about the repetition code R3. 

Exercise 26.3. ^ Prove that the sum-product algorithm correctly computes 
the marginal functions Z n (x n ) if the graph is tree-like. 

Exercise 26.4. ^ Describe how to use the messages computed by the sum- 
product algorithm to obtain more complicated marginal functions in a 
tree-like graph, for example Zi t2 {x\,x 2 ), for two variables x\ and x 2 that 
are connected to one common factor node. 



Starting and finishing, method 2 

Alternatively, the algorithm can be initialized by setting all the initial mes- 
sages from variables to 1: 

for all n, m: q n ^ m (x n ) = 1, (26.17) 

then proceeding with the factor message update rule (26.12), alternating with 
the variable message update rule (26.11). Compared with method 1, this lazy 
initialization method leads to a load of wasted computations, whose results 
are gradually flushed out by the correct answers computed by method 1. 

After a number of iterations equal to the diameter of the factor graph, 
the algorithm will converge to a set of messages satisfying the sum-product 
relationships (26.11, 26.12). 

Exercise 26.5.^ Apply this second version of the sum-product algorithm to 
the function defined in equation (26.4) and figure 26.1. 
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The reason for introducing this lazy method is that (unlike method 1) it can 
be applied to graphs that are not tree-like. When the sum-product algorithm 
is run on a graph with cycles, the algorithm does not necessarily converge, 
and certainly does not in general compute the correct marginal functions; but 
it is nevertheless an algorithm of great practical importance, especially in the 
decoding of sparse-graph codes. 



Sum-product algorithm with on-the-fly normalization 

If we are interested in only the normalized marginals, then another version 
of the sum-product algorithm may be useful. The factor-to-variable messages 
r-m^n are computed in just the same way (26.12), but the variable-to-factor 
messages are normalized thus: 

(x n ) (26.18) 

m'€M(n)\m 
where a nm is a scalar chosen such that 

Y,<ln^ m {Xn) = l- (26.19) 

Exercise 26.6.^ Apply this normalized version of the sum-product algorithm 
to the function defined in equation (26.4) and figure 26.1. 



A factorization view of the sum-product algorithm 

One way to view the sum-product algorithm is that it reexpresses the original 
factored function, the product of M factors P*(x) = rim=i /m( x m), as another 
factored function which is the product of M + N factors, 

M N 

m( x ra ) J\ ^n{x n )- (26.20) 
m=l n=l 

Each factor (f> m is associated with a factor node m, and each factor ip n (x n ) is 
associated with a variable node. Initially (/> m (x TO ) = /m( x m) and ipn(x n ) = 1. 

Each time a factor-to-variable message r m -> n (x n ) is sent, the factorization 
is updated thus: 

(x n ) (26.21) 
^ m (x m ) = =. /(Xm) (26.22) 

lln&/V(m) r m^n(X n ) 

And each message can be computed in terms of and ip using 



r m ^ n {x n ) = ^2 <^™( x ™) II ^n'{x n >) ] (26.23) 

x m \n \ n'eN{m) J 

which differs from the assignment (26.12) in that the product is over all n' € 
M(m). 

Exercise 26.7. ^ Confirm that the update rules (26.21-26.23) are equivalent 
to the sum-product rules (26.11-26.12). So tjj n {x n ) eventually becomes 
the marginal Z n (x n ). 



This factorization viewpoint applies whether or not the graph is tree-like. 
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26.3: The min-sum algorithm 
Computational tricks 

On-the-fly normalization is a good idea from a computational point of view 
because if P* is a product of many factors, its values are likely to be very large 
or very small. 

Another useful computational trick involves passing the logarithms of the 
messages q and r instead of q and r themselves; the computations of the 
products in the algorithm (26.11, 26.12) are then replaced by simpler additions. 
The summations in (26.12) of course become more difficult: to carry them out 
and return the logarithm, we need to compute softmax functions like 

I =ln(e h +e h + e h ). (26.24) 

But this computation can be done efficiently using look-up tables along with 
the observation that the value of the answer I is typically just a little larger 
than maxi l{. If we store in look-up tables values of the function 

ln(l + e' 5 ) (26.25) 

(for negative 8) then I can be computed exactly in a number of look-ups and 
additions scaling as the number of terms in the sum. If look-ups and sorting 
operations are cheaper than expO then this approach costs less than the 
direct evaluation (26.24). The number of operations can be further reduced 
by omitting negligible contributions from the smallest of the {U}. 

A third computational trick applicable to certain error-correcting codes is 
to pass not the messages but the Fourier transform of the messages. This 
again makes the computations of the factor-to- variable messages quicker. A 
simple example of this Fourier transform trick is given in Chapter 47 at equa- 
tion (47.9). 

► 26.3 The min-sum algorithm 

The sum-product algorithm solves the problem of finding the marginal func- 
tion of a given product P* (x) . This is analogous to solving the bitwise decod- 
ing problem of section 25.1. And just as there were other decoding problems 
(for example, the codeword decoding problem), we can define other tasks 
involving P* (x) that can be solved by modifications of the sum-product algo- 
rithm. For example, consider this task, analogous to the codeword decoding 
problem: 

The maximization problem. Find the setting of x that maximizes the 
product P*(x). 

This problem can be solved by replacing the two operations add and mul- 
tiply everywhere they appear in the sum-product algorithm by another pair 
of operations that satisfy the distributive law, namely max and multiply. If 
we replace summation (+, J^) by maximization, we notice that the quantity 
formerly known as the normalizing constant, 

Z = ^P*(x), (26.26) 

X 

becomes max x P*(x). 

Thus the sum-product algorithm can be turned into a max-product algo- 
rithm that computes max x P* (x) , and from which the solution of the max- 
imization problem can be deduced. Each 'marginal' Z n (x n ) then lists the 
maximum value that P*(x) can attain for each value of x n . 
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In practice, the max-product algorithm is most often carried out in the 
negative log likelihood domain, where max and product become min and sum. 
The min-sum algorithm is also known as the Viterbi algorithm. 

► 26.4 The junction tree algorithm 

What should one do when the factor graph one is interested in is not a tree? 

There are several options, and they divide into exact methods and approx- 
imate methods. The most widely used exact method for handling marginaliza- 
tion on graphs with cycles is called the junction tree algorithm. This algorithm 
works by agglomerating variables together until the agglomerated graph has 
no cycles. You can probably figure out the details for yourself; the complexity 
of the marginalization grows exponentially with the number of agglomerated 
variables. Read more about the junction tree algorithm in (Lauritzen, 1996; 
Jordan, 1998). 

There are many approximate methods, and we'll visit some of them over 
the next few chapters - Monte Carlo methods and variational methods, to 
name a couple. However, the most amusing way of handling factor graphs 
to which the sum-product algorithm may not be applied is, as we already 
mentioned, to apply the sum-product algorithm! We simply compute the 
messages for each node in the graph, as if the graph were a tree, iterate, and 
cross our fingers. This so-called 'loopy' message passing has great importance 
in the decoding of error-correcting codes, and we'll come back to it in section 
33.8 and Part VI. 



Further reading 

For further reading about factor graphs and the sum-product algorithm, see 
Kschischang et al. (2001), Yedidia et al. (2000), Yedidia et al. (2001a), Yedidia 
et al. (2002), Wainwright et al. (2003), and Forney (2001). 

See also Pearl (1988). A good reference for the fundamental theory of 
graphical models is Lauritzen (1996). A readable introduction to Bayesian 
networks is given by Jensen (1996). 

Interesting message-passing algorithms that have different capabilities from 
the sum-product algorithm include expectation propagation (Minka, 2001) 
and survey propagation (Braunstein et al, 2003). See also section 33.8. 

► 26.5 Exercises 

l> Exercise 26.8.^-' Express the joint probability distribution from the burglar 
alarm and earthquake problem (example 21.1 (p. 293)) as a factor graph, 
and find the marginal probabilities of all the variables as each piece of 
information comes to Fred's attention, using the sum-product algorithm 
with on-the-fly normalization. 
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27 



Laplace's Method 



The idea behind the Laplace approximation is simple. We assume that an 
unnormalized probability density P*(x), whose normalizing constant 



Z P 



= J P*{x) 



dx 



(27.1) 



is of interest, has a peak at a point xq. We Taylor-expand the logarithm of 
P*(x) around this peak: 



where 



]nP*(x)~kiP*(x )--(x-x ) 2 



c=-^lnP*(x) 



We then approximate P*(x) by an unnormalized Gaussian, 
Q*(x) = P*(x )exp [-^( x -xo) 2 ] , 



(27.2) 
(27.3) 

(27.4) 





lnP*(x) 
& \nQ*(x) 



and we approximate the normalizing constant Zp by the normalizing constant 
of this Gaussian, 

Z Q = P*(x )J — . (27.5) P*{x) 

V c &Q*{x) 
We can generalize this integral to approximate Zp for a density P*(x) over 
a K-dimensional space x. If the matrix of second derivatives of — lnP*(x) at 
the maximum x is A, defined by: 



Aij — 



2 



dxidxj 



lnP*(x) 



(27.6) 



x=x 



so that the expansion (27.2) is generalized to 



P*(x) 



lnP*(x) 





In P*(x) ~ In P*(x ) - - (x - x ) T A(x - x ) 
then the normalizing constant can be approximated by: 



Zp^z q = p^)^ = = p^^ k . 



\/ det ^ 



(27.7) 



(27.8) 



Predictions can be made using the approximation Q. Physicists also call this 
widely-used approximation the saddle-point approximation. 
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The fact that the normalizing constant of a Gaussian is given by 



/ 



d K x exp 



ix T Ax 

2 



(2T.9) 

det A v ' 



can be proved by making an orthogonal transformation into the basis u in which 
A is transformed into a diagonal matrix. The integral then separates into a 
product of one-dimensional integrals, each of the form 



/ 



ditj exp 



1 ^ 2 

-2 A ^ 



(27.10) 



The product of the eigenvalues Aj is the determinant of A. 



The Laplace approximation is basis-dependent: if x is transformed to a 
nonlinear function u(x) and the density is transformed to P(u) = P(x) \dx/du\ 
then in general the approximate normalizing constants Zq will be different. 
This can be viewed as a defect - since the true value Zp is basis-independent 
- or an opportunity - because we can hunt for a choice of basis in which the 
Laplace approximation is most accurate. 

► 27.1 Exercises 

f^ft| Exercise 27.1.^ (See also exercise 22.8 (p. 307).) A photon counter is pointed 
at a remote star for one minute, in order to infer the rate of photons 
arriving at the counter per minute, A. Assuming the number of photons 
collected r has a Poisson distribution with mean A, 

P(r|A)=exp(-A)^, (27.11) 

and assuming the improper prior P(A) = 1/A, make Laplace approxima- 
tions to the posterior distribution 

(a) over A 

(b) over log A. [Note the improper prior transforms to P(logA) = 
constant.] 

l> Exercise 27.2. ^ Use Laplace's method to approximate the integral 

/oo 
da/(ar(i-/(a))" 2 , (27.12) 
-oo 

where /(a) = 1/(1 + e~ a ) and Mi, U2 are positive. Check the accuracy of 
the approximation against the exact answer (23.29, p. 316) for (ui,W2) = 
(V2, V2) and (ui,U2) = (1,1). Measure the error (logZp — log Zq) in 
bits. 

t> Exercise 27.3. ^ Linear regression. N datapoints {(x^ n \ t^)} are generated by 
the experimenter choosing each x^ n \ then the world delivering a noisy 
version of the linear function 

y(x) = wo + w\x, (27.13) 

i<") ~ Norma%(2>>), a 2 v ). (27.14) 

Assuming Gaussian priors on wq and w\, make the Laplace approxima- 
tion to the posterior distribution of w$ and w\ (which is exact, in fact) 
and obtain the predictive distribution for the next datapoint t ( Ar+1 ) , given 

(See MacKay (1992a) for further reading.) 
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28 



Model Comparison and Occam's Razor 




Figure 28.1. 

interpreted, 
some boxes. 



A picture to be 

It contains a tree and 
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► 28.1 Occam's razor 

How many boxes are in the picture (figure 28.1)? In particular, how many 
boxes are in the vicinity of the tree? If we looked with x-ray spectacles, 
would we see one or two boxes behind the trunk (figure 28.2)? (Or even 
more?) Occam's razor is the principle that states a preference for simple 
theories. 'Accept the simplest explanation that fits the data'. Thus according 
to Occam's razor, we should deduce that there is only one box behind the tree. 
Is this an ad hoc rule of thumb? Or is there a convincing reason for believing 
there is most likely one box? Perhaps your intuition likes the argument 'well, 
it would be a remarkable coincidence for the two boxes to be just the same 
height and colour as each other'. If we wish to make artificial intelligences 
that interpret data correctly, we must translate this intuitive feeling into a 
concrete theory. 



2? 



Figure 28.2. How many boxes are 
behind the tree? 



Motivations for Occam's razor 

If several explanations are compatible with a set of observations, Occam's 
razor advises us to buy the simplest. This principle is often advocated for one 
of two reasons: the first is aesthetic ('A theory with mathematical beauty is 
more likely to be correct than an ugly one that fits some experimental data' 
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(Paul Dirac)); the second reason is the past empirical success of Occam's razor. 
However there is a different justification for Occam's razor, namely: 

Coherent inference (as embodied by Bayesian probability) auto- 
matically embodies Occam's razor, quantitatively. 

It is indeed more probable that there's one box behind the tree, and we can 
compute how much more probable one is than two. 

Model comparison and Occam's razor 

We evaluate the plausibility of two alternative theories Hi and H2 in the light 
of data D as follows: using Bayes' theorem, we relate the plausibility of model 
Hi given the data, P(Hi \D), to the predictions made by the model about 
the data, P(D\Hi), and the prior plausibility of Hi, P(Hi). This gives the 
following probability ratio between theory Hi and theory H2 '■ 



P(Hi\D) P{Hi)P(D\Hi) 



(28.1) 



P{H 2 \D) P(H 2 )P(D\H 2 )- 

The first ratio (P(Hi)/P(H2)) on the right-hand side measures how much our 
initial beliefs favoured Hi over Ti.2- The second ratio expresses how well the 
observed data were predicted by Hi, compared to H2- 

How does this relate to Occam's razor, when Hi is a simpler model than 
H2' The first ratio (P(Hi)/P(H2)) gives us the opportunity, if we wish, to 
insert a prior bias in favour of Hi on aesthetic grounds, or on the basis of 
experience. This would correspond to the aesthetic and empirical motivations 
for Occam's razor mentioned earlier. But such a prior bias is not necessary: 
the second ratio, the data-dependent factor, embodies Occam's razor auto- 
matically. Simple models tend to make precise predictions. Complex models, 
by their nature, are capable of making a greater variety of predictions (figure 
28.3). So if H2 is a more complex model, it must spread its predictive proba- 
bility P(D I H2) more thinly over the data space than Hi. Thus, in the case 
where the data are compatible with both theories, the simpler Hi will turn out 
more probable than H2, without our having to express any subjective dislike 
for complex models. Our subjective prior just needs to assign equal prior prob- 
abilities to the possibilities of simplicity and complexity. Probability theory 
then allows the observed data to express their opinion. 

Let us turn to a simple example. Here is a sequence of numbers: 

-1, 3, 7, 11. 



Figure 28.3. Why Bayesian 
inference embodies Occam's razor. 
This figure gives the basic 
intuition for why complex models 
can turn out to be less probable. 
The horizontal axis represents the 
space of possible data sets D. 
Bayes' theorem rewards models in 
proportion to how much they 
predicted the data that occurred. 
These predictions are quantified 
by a normalized probability 
distribution on D. This 
probability of the data given 
model Hi, P(D \ Hi), is called the 
evidence for Hi- 

A simple model Hi makes only a 
limited range of predictions, 
shown by P(D \ Hi); a more 
powerful model H2, that has, for 
example, more free parameters 
than Hi, is able to predict a 
greater variety of data sets. This 
means, however, that H2 does not 
predict the data sets in region C\ 
as strongly as Hi- Suppose that 
equal prior probabilities have been 
assigned to the two models. Then, 
if the data set falls in region Ci, 
the less powerful model Hi will be 
the more probable model. 



The task is to predict the next two numbers, and infer the underlying process 
that gave rise to this sequence. A popular answer to this question is the 
prediction '15, 19', with the explanation 'add 4 to the previous number'. 

What about the alternative answer ' — 19.9,1043.8' with the underlying 
rule being: 'get the next number from the previous number, x, by evaluating 
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—x s /ll + 9/llx 2 + 23/11'? I assume that this prediction seems rather less 
plausible. But the second rule fits the data (—1, 3, 7, 11) just as well as the 
rule 'add 4'. So why should we find it less plausible? Let us give labels to the 
two general theories: 

H a — the sequence is an arithmetic progression, 'add n\ where n is an integer. 

H c — the sequence is generated by a cubic function of the form x — > cx 3 + 
dx 2 + e, where c, d and e are fractions. 

One reason for finding the second explanation, H c , less plausible, might be 
that arithmetic progressions are more frequently encountered than cubic func- 
tions. This would put a bias in the prior probability ratio P(H a )/ P(H C ) in 
equation (28.1). But let us give the two theories equal prior probabilities, and 
concentrate on what the data have to say. How well did each theory predict 
the data? 

To obtain P(D | H a ) we must specify the probability distribution that each 
model assigns to its parameters. First, H a depends on the added integer n, 
and the first number in the sequence. Let us say that these numbers could 
each have been anywhere between —50 and 50. Then since only the pair of 
values {n = 4, first number= — 1} give rise to the observed data D = (—1, 3, 
7, 11), the probability of the data, given 1~L a , is: 

p(d | H a ) = Y^fYrJY = °- 0001 °- (28.2) 

To evaluate P(D\H C ), we must similarly say what values the fractions c, d 
and e might take on. [I choose to represent these numbers as fractions rather 
than real numbers because if we used real numbers, the model would assign, 
relative to H a , an infinitesimal probability to D. Real parameters are the 
norm however, and are assumed in the rest of this chapter.] A reasonable 
prior might state that for each fraction the numerator could be any number 
between —50 and 50, and the denominator is any number between 1 and 50. 
As for the initial value in the sequence, let us leave its probability distribution 
the same as in H a - There are four ways of expressing the fraction c = — 1/11 = 
—2/22 = —3/33 = —4/44 under this prior, and similarly there are four and two 
possible solutions for d and e, respectively. So the probability of the observed 
data, given H c , is found to be: 

P{D ^ = Got) (iot4) (mh) (i5i5o-) 

= 0.0000000000025 = 2.5 x 10~ 12 . (28.3) 

Thus comparing P(D \ H c ) with P(D \ H a ) = 0.00010, even if our prior prob- 
abilities for H a and H c are equal, the odds, P(D | H a ) '■ P(D \ H c ), in favour 
of H a over H c , given the sequence D = (—1, 3, 7, 11), are about forty million 
to one. □ 
This answer depends on several subjective assumptions; in particular, the 
probability assigned to the free parameters n, c, d, e of the theories. Bayesians 
make no apologies for this: there is no such thing as inference or prediction 
without assumptions. However, the quantitative details of the prior proba- 
bilities have no effect on the qualitative Occam's razor effect; the complex 
theory H c always suffers an 'Occam factor' because it has more parameters, 
and so can predict a greater variety of data sets (figure 28.3). This was only 
a small example, and there were only four data points; as we move to larger 
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and more sophisticated problems the magnitude of the Occam factors typi- 
cally increases, and the degree to which our inferences are influenced by the 
quantitative details of our subjective assumptions becomes smaller. 

Bayesian methods and data analysis 

Let us now relate the discussion above to real problems in data analysis. 

There are countless problems in science, statistics and technology which 
require that, given a limited data set, preferences be assigned to alternative 
models of differing complexities. For example, two alternative hypotheses 
accounting for planetary motion are Mr. Inquisition's geocentric model based 
on 'epicycles', and Mr. Copernicus's simpler model of the solar system with 
the sun at the centre. The epicyclic model fits data on planetary motion at 
least as well as the Copernican model, but does so using more parameters. 
Coincidentally for Mr. Inquisition, two of the extra epicyclic parameters for 
every planet are found to be identical to the period and radius of the sun's 
'cycle around the earth'. Intuitively we find Mr. Copernicus's theory more 
probable. 



Figure 28.4. Where Bayesian 
inference fits into the data 
modelling process. 
This figure illustrates an 
abstraction of the part of the 
scientific process in which data 
are collected and modelled. In 
particular, this figure applies to 
pattern classification, learning, 
interpolation, etc. The two 
double-framed boxes denote the 
two steps which involve inference. 
It is only in those two steps that 
Bayes' theorem can be used. 
Bayes does not tell you how to 
invent models, for example. 
The first box, 'fitting each model 
to the data', is the task of 
inferring what the model 
parameters might be given the 
model and the data. Bayesian 
methods may be used to find the 
most probable parameter values, 
and error bars on those 
parameters. The result of 
applying Bayesian methods to this 
problem is often little different 
from the answers given by 
orthodox statistics. 
The second inference task, model 
comparison in the light of the 
data, is where Bayesian methods 
are in a class of their own. This 
second inference problem requires 
a quantitative Occam's razor to 
penalize over-complex models. 
Bayesian methods can assign 
objective preferences to the 
alternative models in a way that 
automatically embodies Occam's 



The mechanism of the Bayesian razor: the evidence and the Occam factor 

Two levels of inference can often be distinguished in the process of data mod- 
elling. At the first level of inference, we assume that a particular model is true, 
and we fit that model to the data, i.e., we infer what values its free param- 
eters should plausibly take, given the data. The results of this inference are 
often summarized by the most probable parameter values, and error bars on 
those parameters. This analysis is repeated for each model. The second level 
of inference is the task of model comparison. Here we wish to compare the 
models in the light of the data, and assign some sort of preference or ranking 
to the alternatives. 

Note that both levels of inference arc distinct from decision theory. The goal 
of inference is, given a defined hypothesis space and a particular data set, to 
assign probabilities to hypotheses. Decision theory typically chooses between 
alternative actions on the basis of these probabilities so as to minimize the 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 

28.1: Occam's razor 

expectation of a 'loss function'. This chapter concerns inference alone and no 
loss functions are involved. When we discuss model comparison, this should 
not be construed as implying model choice. Ideal Bayesian predictions do not 
involve choice between models; rather, predictions are made by summing over 
all the alternative models, weighted by their probabilities. 

Bayesian methods are able consistently and quantitatively to solve both 
the inference tasks. There is a popular myth that states that Bayesian meth- 
ods only differ from orthodox statistical methods by the inclusion of subjective 
priors, which are difficult to assign, and which usually don't make much dif- 
ference to the conclusions. It is true that, at the first level of inference, a 
Bayesian's results will often differ little from the outcome of an orthodox at- 
tack. What is not widely appreciated is how a Bayesian performs the second 
level of inference; this chapter will therefore focus on Bayesian model compar- 
ison. 

Model comparison is a difficult task because it is not possible simply to 
choose the model that fits the data best: more complex models can always 
fit the data better, so the maximum likelihood model choice would lead us 
inevitably to implausible, over-parameterized models, which generalize poorly. 
Occam's razor is needed. 

Let us write down Bayes' theorem for the two levels of inference described 
above, so as to see explicitly how Bayesian model comparison works. Each 
model Hi is assumed to have a vector of parameters w. A model is defined 
by a collection of probability distributions: a 'prior' distribution P(w \ Hi), 
which states what values the model's parameters might be expected to take; 
and a set of conditional distributions, one for each value of w, defining the 
predictions P(D | w, Hi) that the model makes about the data D. 

1. Model fitting. At the first level of inference, we assume that one model, 
the ith, say, is true, and we infer what the model's parameters w might 
be, given the data D. Using Bayes' theorem, the posterior probability 
of the parameters w is: 

PM^ ^ff™ , (28.4) 

that is, 

Likelihood x Prior 

Posterior = — — . 

Evidence 

The normalizing constant P(D \ Hi) is commonly ignored since it is irrel- 
evant to the first level of inference, i.e., the inference of w; but it becomes 
important in the second level of inference, and we name it the evidence 
for Hi- It is common practice to use gradient-based methods to find the 
maximum of the posterior, which defines the most probable value for the 
parameters, w MP ; it is then usual to summarize the posterior distribution 
by the value of w MP , and error bars or confidence intervals on these best 
fit parameters. Error bars can be obtained from the curvature of the pos- 
terior; evaluating the Hessian at w MP , A = — VVlnP(w | D,Hi)\ WMp , 
and Taylor-expanding the log posterior probability with Aw 

P(w I D, Hi) ~ P(w MP | D, Hi) exp (-V2Aw T AAw) , (28.5) 

we see that the posterior can be locally approximated as a Gaussian 
with covariance matrix (equivalent to error bars) A -1 . [Whether this 
approximation is good or not will depend on the problem we are solv- 
ing. Indeed, the maximum and mean of the posterior distribution have 
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no fundamental status in Bayesian inference - they both change under 
nonlinear reparameterizations. Maximization of a posterior probabil- 
ity is only useful if an approximation like equation (28.5) gives a good 
summary of the distribution.] 



Figure 28.5. The Occam factor. 
This figure shows the quantities 
that determine the Occam factor 
for a hypothesis Hi having a 
single parameter w. The prior 
distribution (solid line) for the 
parameter has width cr w . The 
posterior distribution (dashed 
line) has a single peak at w MP 
with characteristic width o w \ D . 
The Occam factor is 



^Id^Kp I Hi) 



7 W \D 



2. Model comparison. At the second level of inference, we wish to infer 
which model is most plausible given the data. The posterior probability 
of each model is: 



p(n l \D)^p(D\n i )P(n i ). (28.6) 

Notice that the data-dependent term P(D \ TLi) is the evidence for Hi, 
which appeared as the normalizing constant in (28.4). The second term, 
P(Hi), is the subjective prior over our hypothesis space, which expresses 
how plausible we thought the alternative models were before the data 
arrived. Assuming that we choose to assign equal priors P{Hi) to the 
alternative models, models Hi are ranked by evaluating the evidence. The 
normalizing constant P(D) = P(D \ Hi)P(Hi) has been omitted from 
equation (28.6) because in the data-modelling process we may develop 
new models after the data have arrived, when an inadequacy of the first 
models is detected, for example. Inference is open ended: we continually 
seek more probable models to account for the data we gather. 

To repeat the key idea: to rank alternative models Hi, a Bayesian eval- 
uates the evidence P(D\Hi). This concept is very general: the ev- 
idence can be evaluated for parametric and 'non-parametric' models 
alike; whatever our data-modelling task, a regression problem, a clas- 
sification problem, or a density estimation problem, the evidence is a 
transportable quantity for comparing alternative models. In all these 
cases the evidence naturally embodies Occam's razor. 



Evaluating the evidence 

Let us now study the evidence more closely to gain insight into how the 
Bayesian Occam's razor works. The evidence is the normalizing constant for 
equation (28.4): 

P(D \7U) = J P(D | w, Hi)P(w | Hi) dw. (28.7) 

For many problems the posterior P(w | D,Hi) oc P(D | w, Hi)P(w \ Hi) has 
a strong peak at the most probable parameters w MP (figure 28.5). Then, 
taking for simplicity the one-dimensional case, the evidence can be approx- 
imated, using Laplace's method, by the height of the peak of the integrand 
P(D | w, Hi)P(w | Hi) times its width, cr w \ D : 
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P{D\Hi) ~ P(£)|wmp,W<) xP{w MP \Hi)a wlD . (28.8) 

' " ' * » 

Evidence ~ Best fit likelihood x Occam factor 

Thus the evidence is found by taking the best fit likelihood that the model 
can achieve and multiplying it by an 'Occam factor', which is a term with 
magnitude less than one that penalizes Hi for having the parameter w. 

Interpretation of the Occam factor 

The quantity <J w \d is the posterior uncertainty in w. Suppose for simplicity 
that the prior P(w | Hi) is uniform on some large interval a w , representing the 
range of values of w that were possible a priori, according to Hi (figure 28.5). 
Then P(w MP | Hi) = 1/<J W , and 

^w\D 

Occam factor = , (28.9) 

i.e., the Occam factor is equal to the ratio of the posterior accessible volume 
of Hi 's parameter space to the prior accessible volume, or the factor by which 
Hi's hypothesis space collapses when the data arrive. The model Hi can be 
viewed as consisting of a certain number of exclusive submodels, of which only 
one survives when the data arrive. The Occam factor is the inverse of that 
number. The logarithm of the Occam factor is a measure of the amount of 
information we gain about the model's parameters when the data arrive. 

A complex model having many parameters, each of which is free to vary 
over a large range a w , will typically be penalized by a stronger Occam factor 
than a simpler model. The Occam factor also penalizes models that have to 
be finely tuned to fit the data, favouring models for which the required pre- 
cision of the parameters o w \q is coarse. The magnitude of the Occam factor 
is thus a measure of complexity of the model; it relates to the complexity of 
the predictions that the model makes in data space. This depends not only 
on the number of parameters in the model, but also on the prior probability 
that the model assigns to them. Which model achieves the greatest evidence 
is determined by a trade-off between minimizing this natural complexity mea- 
sure and minimizing the data misfit. In contrast to alternative measures of 
model complexity, the Occam factor for a model is straightforward to evalu- 
ate: it simply depends on the error bars on the parameters, which we already 
evaluated when fitting the model to the data. 

Figure 28.6 displays an entire hypothesis space so as to illustrate the var- 
ious probabilities in the analysis. There are three models, Hi,H2,Hs, which 
have equal prior probabilities. Each model has one parameter w (each shown 
on a horizontal axis), but assigns a different prior range a w to that parame- 
ter. Hz is the most 'flexible' or 'complex' model, assigning the broadest prior 
range. A one-dimensional data space is shown by the vertical axis. Each 
model assigns a joint probability distribution P(D ,w\Hi) to the data and 
the parameters, illustrated by a cloud of dots. These dots represent random 
samples from the full probability distribution. The total number of dots in 
each of the three model subspaces is the same, because we assigned equal prior 
probabilities to the models. 

When a particular data set D is received (horizontal line), we infer the pos- 
terior distribution of w for a model {H3, say) by reading out the density along 
that horizontal line, and normalizing. The posterior probability P(w | D, H3) 
is shown by the dotted curve at the bottom. Also shown is the prior distribu- 
tion P(w I H3) (cf. figure 28.5). [In the case of model H\ which is very poorly 
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Figure 28.6. A hypothesis space 
consisting of three exclusive 
models, each having one 
parameter w, and a 
one-dimensional data set D. The 
'data set' is a single measured 
value which differs from the 
parameter w by a small amount 
of additive noise. Typical samples 
from the joint distribution 
P(D,w,H) are shown by dots. 
(N.B., these are not data points.) 
The observed 'data set' is a single 
particular value for D shown by 
the dashed horizontal line. The 
dashed curves below show the 
posterior probability of w for each 
model given this data set (cf. 
figure 28.3). The evidence for the 
different models is obtained by 
marginalizing onto the D axis at 
the left-hand side (cf. figure 28.5). 



matched to the data, the shape of the posterior distribution will depend on 
the details of the tails of the prior P(w | H\) and the likelihood P(D \ -w,H\); 
the curve shown is for the case where the prior falls off more strongly.] 

We obtain figure 28.3 by marginalizing the joint distributions P(D, w | TLi) 
onto the D axis at the left-hand side. For the data set D shown by the dotted 
horizontal line, the evidence P(D \ 7i^) for the more flexible model 7i% has 
a smaller value than the evidence for H2- This is because H3 placed less 
predictive probability (fewer dots) on that line. In terms of the distributions 
over w, model TL3 has smaller evidence because the Occam factor o w \ D ja w is 
smaller for H3 than for H.2- The simplest model TL\ has the smallest evidence 
of all, because the best fit that it can achieve to the data D is very poor. 
Given this data set, the most probable model is 7^2- 

Occam factor for several parameters 

If the posterior is well approximated by a Gaussian, then the Occam factor 
is obtained from the determinant of the corresponding covariance matrix (cf. 
equation (28.8) and Chapter 27): 

P(D\Hi) ~ P{D I w M p, Hi) x P(w MP I Hi) det~5(A/27r), (28.10) 

Evidence ~ Best fit likelihood x Occam factor 

where A = — VV In P(w | D, 7i.i), the Hessian which we evaluated when we 
calculated the error bars on w MP (equation 28.5 and Chapter 27). As the 
amount of data collected increases, this Gaussian approximation is expected 
to become increasingly accurate. 

In summary, Bayesian model comparison is a simple extension of maximum 
likelihood model selection: the evidence is obtained by multiplying the best fit 
likelihood by the Occam, factor. 

To evaluate the Occam factor we need only the Hessian A, if the Gaussian 
approximation is good. Thus the Bayesian method of model comparison by 
evaluating the evidence is no more computationally demanding than the task 
of finding for each model the best fit parameters and their error bars. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



28.2: Example 



351 



► 28.2 Example 

Let's return to the example that opened this chapter. Are there one or two 
boxes behind the tree in figure 28.1? Why do coincidences make us suspicious? 

Let's assume the image of the area round the trunk and box has a size 
of 50 pixels, that the trunk is 10 pixels wide, and that 16 different colours of 
boxes can be distinguished. The theory 7i\ that says there is one box near 
the trunk has four free parameters: three coordinates defining the top three 
edges of the box, and one parameter giving the box's colour. (If boxes could 
levitate, there would be five free parameters.) 

The theory Ti.2 that says there are two boxes near the trunk has eight free 
parameters (twice four), plus a ninth, a binary variable that indicates which 
of the two boxes is the closest to the viewer. 

What is the evidence for each model? We'll do TL\ first. We need a prior on 
the parameters to evaluate the evidence. For convenience, let's work in pixels. 
Let's assign a separable prior to the horizontal location of the box, its width, 
its height, and its colour. The height could have any of, say, 20 distinguishable 
values, so could the width, and so could the location. The colour could have 
any of 16 values. We'll put uniform priors over these variables. We'll ignore 
all the parameters associated with other objects in the image, since they don't 
come into the model comparison between TL\ and 7^2- The evidence is 



P{D\H X ) 



1111 
20 20 20 16 



(28.11) 



since only one setting of the parameters fits the data, and it predicts the data 
perfectly. 

As for model 7^2, six of its nine parameters are well-determined, and three 
of them are partly-constrained by the data. If the left-hand box is furthest 
away, for example, then its width is at least 8 pixels and at most 30; if it's 
the closer of the two boxes, then its width is between 8 and 18 pixels. (I'm 
assuming here that the visible portion of the left-hand box is about 8 pixels 
wide.) To get the evidence we need to sum up the prior probabilities of all 
viable hypotheses. To do an exact calculation, we need to be more specific 
about the data and the priors, but let's just get the ballpark answer, assuming 
that the two unconstrained real variables have half their values available, and 
that the binary variable is completely undetermined. (As an exercise, you can 
make an explicit model and work out the exact answer.) 



P(D\H 2 )^ . (28.12 

v 1 ; 20 20 20 16 20 20 20 16 2 v 

Thus the posterior probability ratio is (assuming equal prior probability): 
P(D\Hi)P(Hi) 
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(28.13) 
(28.14) 



So the data are roughly 1000 to 1 in favour of the simpler hypothesis. The 
four factors in (28.13) can be interpreted in terms of Occam factors. The more 
complex model has four extra parameters for sizes and colours - three for sizes, 
and one for colour. It has to pay two big Occam factors (V20 and Vl6) for the 
highly suspicious coincidences that the two box heights match exactly and the 
two colours match exactly; and it also pays two lesser Occam factors for the 
two lesser coincidences that both boxes happened to have one of their edges 
conveniently hidden behind a tree or behind each other. 



1? 



or 2? 



Figure 28.7. How many boxes are 
behind the tree? 
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Hi: 




L(w* 1) |W 1 ) 


L(D\ W * (1) ,H 1 ) 


H 2 : 


L(H 2 ) 


L(w* 2) \H 2 ) 


L(D\v,* {2) ,H 2 ) 




H 3 : 


L(H 3 ) 


L(w* (3) \H 3 ) 


L(D\w* 3) ,H 3 ) 



► 28.3 Minimum description length (MDL) 

A complementary view of Bayesian model comparison is obtained by replacing 
probabilities of events by the lengths in bits of messages that communicate 
the events without loss to a receiver. Message lengths L(x) correspond to a 
probabilistic model over events x via the relations: 

2 -L(x) 



P(x) 



L(x) 



log 2 P(x) 



(28.15) 



The MDL principle (Wallace and Boulton, 1968) states that one should 
prefer models that can communicate the data in the smallest number of bits. 
Consider a two-part message that states which model, H, is to be used, and 
then communicates the data D within that model, to some pre-arranged pre- 
cision SD. This produces a message of length L(D,H) = L(H) + L(D \H). 
The lengths L(7{) for different TL define an implicit prior P(TL) over the alter- 
native models. Similarly L(D \ Ti) corresponds to a density P(D \ Ti). Thus, a 
procedure for assigning message lengths can be mapped onto posterior prob- 
abilities: 

L(D,TL) = -logP(W) -log{P{D\H)SD) (28.16) 
= - log P(W | L>) + const. (28.17) 

In principle, then, MDL can always be interpreted as Bayesian model compar- 
ison and vice versa. However, this simple discussion has not addressed how 
one would actually evaluate the key data-dependent term L(D\H), which 
corresponds to the evidence for Ti. Often, this message is imagined as being 
subdivided into a parameter block and a data block (figure 28.8). Models with 
a small number of parameters have only a short parameter block but do not 
fit the data well, and so the data message (a list of large residuals) is long. As 
the number of parameters increases, the parameter block lengthens, and the 
data message becomes shorter. There is an optimum model complexity (Ti.2 
in the figure) for which the sum is minimized. 

This picture glosses over some subtle issues. We have not specified the 
precision to which the parameters w should be sent. This precision has an 
important effect (unlike the precision SD to which real-valued data D are 
sent, which, assuming SD is small relative to the noise level, just introduces 
an additive constant). As we decrease the precision to which w is sent, the 
parameter message shortens, but the data message typically lengthens because 
the truncated parameters do not match the data so well. There is a non-trivial 
optimal precision. In simple Gaussian cases it is possible to solve for this 
optimal precision (Wallace and Freeman, 1987), and it is closely related to the 
posterior error bars on the parameters, A -1 , where A = — VVlnP(w | D,Ti). 
It turns out that the optimal parameter message length is virtually identical to 
the log of the Occam factor in equation (28.10). (The random element involved 
in parameter truncation means that the encoding is slightly sub-optimal.) 

With care, therefore, one can replicate Bayesian results in MDL terms. 
Although some of the earliest work on complex model comparison involved 
the MDL framework (Patrick and Wallace, 1982), MDL has no apparent ad- 
vantages over the direct probabilistic approach. 



Figure 28.8. A popular view of 
model comparison by minimum 
description length. Each model 
Hi communicates the data D by 
sending the identity of the model, 
sending the best fit parameters of 
the model w*, then sending the 
data relative to those parameters. 
As we proceed to more complex 
models the length of the 
parameter message increases. On 
the other hand, the length of the 
data message decreases, because a 
complex model is able to fit the 
data better, making the residuals 
smaller. In this example the 
intermediate model H 2 achieves 
the optimum trade-off between 
these two trends. 
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MDL does have its uses as a pedagogical tool. The description length 
concept is useful for motivating prior probability distributions. Also, different 
ways of breaking down the task of communicating data using a model can give 
helpful insights into the modelling process, as will now be illustrated. 



On-line learning and cross-validation. 

In cases where the data consist of a sequence of points D = t^',t^- 2 ', . . . ,t^ N \ 
the log evidence can be decomposed as a sum of 'on-line' predictive perfor- 
mances: 

log P(D\H) = logP(t« |W)+logP(t( 2 )|tW,W) 

+ log P(t< 3 > 1 t« , t< 2 > , H) + ■ ■ ■ + log P{t (N) 1 1« . . . t^" 1 ' , H) . (28.18) 

This decomposition can be used to explain the difference between the ev- 
idence and 'leave-one-out cross-validation' as measures of predictive abil- 
ity. Cross-validation examines the average value of just the last term, 
log.P(t( JV) 1 . . .t^^^Ti), under random re-orderings of the data. The evi- 
dence, on the other hand, sums up how well the model predicted all the data, 
starting from scratch. 



The 'bits-back' encoding method. 

Another MDL thought experiment (Hinton and van Camp, 1993) involves in- 
corporating random bits into our message. The data are communicated using a 
parameter block and a data block. The parameter vector sent is a random sam- 
ple from the posterior, P{w\D,H) = P(D\w,H)P(w\H)/P(D\H). This 
sample w is sent to an arbitrary small granularity <5w using a message length 
L(w\H) = —log[P(w\H)6w]. The data are encoded relative to w with a 
message of length L(D \ w,H) = — log[P(P | w, H)SD}. Once the data mes- 
sage has been received, the random bits used to generate the sample w from 
the posterior can be deduced by the receiver. The number of bits so recov- 
ered is — log[P(w | D,H)5w}. These recovered bits need not count towards the 
message length, since we might use some other optimally-encoded message as 
a random bit string, thereby communicating that message at the same time. 
The net description cost is therefore: 

L(w|W) + L(J>|w,W)- 'Bits back' = -log ^ W '^|^^ 

= -log P(D\H) -logSD. (28.19) 

Thus this thought experiment has yielded the optimal description length. Bits- 
back encoding has been turned into a practical compression method for data 
modelled with latent variable models by Frey (1998). 



Further reading 

Bayesian methods are introduced and contrasted with sampling-theory statis- 
tics in (Jaynes, 1983; Gull, 1988; Loredo, 1990). The Bayesian Occam's razor 
is demonstrated on model problems in (Gull, 1988; MacKay, 1992a). Useful 
textbooks are (Box and Tiao, 1973; Berger, 1985). 

One debate worth understanding is the question of whether it's permis- 
sible to use improper priors in Bayesian inference (Dawid et ai, 1996). If 
we want to do model comparison (as discussed in this chapter), it is essen- 
tial to use proper priors - otherwise the evidences and the Occam factors are 
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meaningless. Only when one has no intention to do model comparison may 
it be safe to use improper priors, and even in such cases there are pitfalls, as 
Dawid et al. explain. I would agree with their advice to always use proper 
priors, tempered by an encouragement to be smart when making calculations, 
recognizing opportunities for approximation. 

► 28.4 Exercises 

Exercise 28.1. ^ Random variables x come independently from a probability 
distribution P(x). According to model Ho, P(x) is a uniform distribu- 
tion 

P{x\H Q ) = \ ie(-l,l). (28.20) 

According to model Hi, P{x) is a nonuniform distribution with an un- 
known parameter m € ( — 1,1): 

P(x\m,Hi) = -(1 + mx) x € (-1,1). (28.21) 

Given the data D = {0.3,0.5,0.7,0.8,0.9}, what is the evidence for Ho 
and Hi? 

Exercise 28.2. Datapoints (x,t) are believed to come from a straight line. 

The experimenter chooses x, and t is Gaussian-distributed about y = Wq + WlX 

y = w + u>ix (28.22) 

with variance cr^. According to model Hi, the straight line is horizontal, 
so wi = 0. According to model Hi, w\ is a parameter with prior distribu- 
tion Normal(0, 1). Both models assign a prior distribution Normal(0, 1) 
to Wq- Given the data set D = {(—8,8), (—2, 10), (6, 11)}, and assuming 
the noise level is a v = 1, what is the evidence for each model? 

Exercise 28.3. ^ A six-sided die is rolled 30 times and the numbers of times 
each face came up were F = {3,3,2,2,9,11}. What is the probability 
that the die is a perfectly fair die ('7irj'X assuming the alternative hy- 
pothesis Hi says that the die has a biased distribution p, and the prior 
density for p is uniform over the simplex pi > 0, J2iPi = 1? 

Solve this problem two ways: exactly, using the helpful Dirichlet formu- 
lae (23.30, 23.31), and approximately, using Laplace's method. Notice 
that your choice of basis for the Laplace approximation is important. 
See MacKay (1998a) for discussion of this exercise. 

Exercise 28.4. ^ The influence of race on the imposition of the death penalty 
for murder in America has been much studied. The following three-way 
table classifies 326 cases in which the defendant was convicted of mur- 
der. The three variables are the defendant's race, the victim's race, and 
whether the defendant was sentenced to death. (Data from M. Radelet, 
'Racial characteristics and imposition of the death penalty' American 
Sociological Review, 46 (1981), pp. 918-927.) 

White defendant Black defendant 



White victim 19 132 White victim 11 52 

Black victim 9 Black victim 6 97 
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It seems that the death penalty was applied much more often when the 
victim was white then when the victim was black. When the victim was 
white 14% of defendants got the death penalty, but when the victim was 
black 6% of defendants got the death penalty. [Incidentally these data 
provide an example of a phenomenon known as Simpson's paradox: a 
higher fraction of white defendants are sentenced to death overall, but 
in cases involving black victims a higher fraction of black defendants are 
sentenced to death and in cases involving white victims a higher fraction 
of black defendants are sentenced to death.] 

Quantify the evidence for the four alternative hypotheses shown in fig- 
ure 28.9. I should mention that I don't believe any of these models is 
adequate: several additional variables are important in murder cases, 
such as whether the victim and murderer knew each other, whether the 
murder was premeditated, and whether the defendant had a prior crim- 
inal record; none of these variables is included in the table. So this is 
an academic exercise in model comparison rather than a serious study 
of racial bias in the state of Florida. 

The hypotheses are shown as graphical models, with arrows showing 
dependencies between the variables v (victim race), m (murderer race), 
and d (whether death penalty given). Model Hoo has only one free 
parameter, the probability of receiving the death penalty; model Hn has 
four such parameters, one for each state of the variables v and m. Assign 
uniform priors to these variables. How sensitive are the conclusions to 
the choice of prior? 







Figure 28.9. Four hypotheses 
concerning the dependence of the 
imposition of the death penalty d 
on the race of the victim v and 
the race of the convicted murderer 
rn. Hoi, for example, asserts that 
the probability of receiving the 
death penalty does depend on the 
murderer's race, but not on the 
victim's. 
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About Chapter 29 



The last couple of chapters have assumed that a Gaussian approximation to 
the probability distribution we are interested in is adequate. What if it is not? 
We have already seen an example - clustering - where the likelihood function 
is multimodal, and has nasty unboundedly- high spikes in certain locations in 
the parameter space; so maximizing the posterior probability and fitting a 
Gaussian is not always going to work. This difficulty with Laplace's method is 
one motivation for being interested in Monte Carlo methods. In fact, Monte 
Carlo methods provide a general- purpose set of tools with applications in 
Bayesian data modelling and many other fields. 

This chapter describes a sequence of methods: importance sampling, re- 
jection sampling, the Metropolis method, Gibbs sampling and slice sampling. 
For each method, we discuss whether the method is expected to be useful for 
high-dimensional problems such as arise in inference with graphical models. 
[A graphical model is a probabilistic model in which dependencies and inde- 
pendencies of variables are represented by edges in a graph whose nodes are 
the variables.] Along the way, the terminology of Markov chain Monte Carlo 
methods is presented. The subsequent chapter gives a discussion of advanced 
methods for reducing random walk behaviour. 

For details of Monte Carlo methods, theorems and proofs and a full list 
of references, the reader is directed to Neal (1993b), Gilks et al. (1996), and 
Tanner (1996). 

In this chapter I will use the word 'sample' in the following sense: a sample 
from a distribution P(x) is a single realization x whose probability distribution 
is -P(x). This contrasts with the alternative usage in statistics, where 'sample' 
refers to a collection of realizations {x}. 

When we discuss transition probability matrices, I will use a right-multipli- 
cation convention: I like my matrices to act to the right, preferring 

u = Mv (29.1) 

to 

u T = v T M T . (29.2) 

A transition probability matrix Ty or T^j specifies the probability, given the 
current state is j, of making the transition from j to i. The columns of T are 
probability vectors. If we write down a transition probability density, we use 
the same convention for the order of its arguments: T(x'; x) is a transition 
probability density from x to x' . This unfortunately means that you have 
to get used to reading from right to left - the sequence xyz has probability 
T(z; y)T(y;x)7T(x). 
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29 



Monte Carlo Methods 



► 29.1 The problems to be solved 

Monte Carlo methods are computational techniques that make use of random 
numbers. The aims of Monte Carlo methods are to solve one or both of the 
following problems. 

Problem 1: to generate samples {x.^}^ =1 from a given probability distribu- 
tion P(x). 

Problem 2: to estimate expectations of functions under this distribution, for 
example 

$ = (<£(x)) = J d w x P(x)<£(x). (29.3) 

The probability distribution P(x), which we call the target density, might 
be a distribution from statistical physics or a conditional distribution arising 
in data modelling - for example, the posterior probability of a model's pa- 
rameters given some observed data. We will generally assume that x is an 
iV-dimensional vector with real components x n , but we will sometimes con- 
sider discrete spaces also. 

Simple examples of functions </>(x) whose expectations we might be inter- 
ested in include the first and second moments of quantities that we wish to 
predict, from which we can compute means and variances; for example if some 
quantity t depends on x, we can find the mean and variance of t under P(x) 
by finding the expectations of the functions 0i(x) = t(x) and </>2( x ) = (^(x)) 2 , 

$! = £[</>!(*)] and * 2 = £[02(x)], (29.4) 

then using 

t = $i and var(t) = <J> 2 - (29.5) 

It is assumed that P(x) is sufficiently complex that we cannot evaluate these 
expectations by exact methods; so we are interested in Monte Carlo methods. 

We will concentrate on the first problem (sampling), because if we have 
solved it, then we can solve the second problem by using the random samples 
{x^)}^ to give the estimator 

*^I>(* (P) )- ( 29 - 6 ) 

r 

If the vectors {x^j^Lj are generated from P(x) then the expectation of $ is 
<E>. Also, as the number of samples R increases, the variance of $ will decrease 
as a /r, where a 2 is the variance of (f>, 

a 2 = J d w x P(x)(^(x) - $) 2 . (29.7) 



357 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 

358 29 — Monte Carlo Methods 




Figure 29.1. (a) The function 
P*(x) = 

exp[0.4(x - 0.4) 2 - 0.08a; 4 ]. How 
to draw samples from this 
density? (b) The function P*(x) 
evaluated at a discrete set of 
uniformly spaced points {xi}. 
How to draw samples from this 
discrete distribution? 



This is one of the important properties of Monte Carlo methods. 



The accuracy of the Monte Carlo estimate (29.6) depends only on 
the variance of <f>, not on the dimensionality of the space sampled. 
To be precise, the variance of $ goes as <j 2 /R. So regardless of the 
dimensionality of x, it may be that as few as a dozen independent 
samples {x^} suffice to estimate $ satisfactorily. 



We will find later, however, that high dimensionality can cause other diffi- 
culties for Monte Carlo methods. Obtaining independent samples from a given 
distribution -P(x) is often not easy. 



Why is sampling from -P(x) hard? 

We will assume that the density from which we wish to draw samples, -P(x), 
can be evaluated, at least to within a multiplicative constant; that is, we can 
evaluate a function P*(x) such that 

P(x) = P*(x)/Z. (29.8) 

If we can evaluate P*(x), why can we not easily solve problem 1? Why is it in 
general difficult to obtain samples from P(x)? There are two difficulties. The 
first is that we typically do not know the normalizing constant 




(29.9) 



The second is that, even if we did know Z, the problem of drawing samples 
from P(x) is still a challenging one, especially in high-dimensional spaces, 
because there is no obvious way to sample from P without enumerating most 
or all of the possible states. Correct samples from P will by definition tend 
to come from places in x-space where P(x) is big; how can we identify those 
places where P(x) is big, without evaluating P(x) everywhere! There are only 
a few high-dimensional densities from which it is easy to draw samples, for 
example the Gaussian distribution. 

Let us start with a simple one-dimensional example. Imagine that we wish 
to draw samples from the density P(x) = P*(x)/Z where 

P*(x) = exp [0.4(a; - 0.4) 2 - 0.08x 4 ] , x e (-oo, oo). (29.10) 

We can plot this function (figure 29.1a). But that does not mean we can draw 
samples from it. To start with, we don't know the normalizing constant Z. 
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To give ourselves a simpler problem, we could discretize the variable x and 
ask for samples from the discrete probability distribution over a finite set of 
uniformly spaced points {xi} (figure 29.1b). How could we solve this problem? 
If we evaluate p* = P*{xi) at each point x^ we can compute 



(29.11) 



and 



Pi = P*/Z 



(29.12) 



and we can then sample from the probability distribution {pi} using various 
methods based on a source of random bits (see section 6.3). But what is the 
cost of this procedure, and how does it scale with the dimensionality of the 
space, N7 Let us concentrate on the initial cost of evaluating Z (29.11). To 
compute Z we have to visit every point in the space. In figure 29.1b there are 
50 uniformly spaced points in one dimension. If our system had N dimensions, 
N = 1000 say, then the corresponding number of points would be 50 1000 , an 
unimaginable number of evaluations of P* . Even if each component x n took 
only two discrete values, the number of evaluations of P* would be 2 1000 , a 
number that is still horribly huge. If every electron in the universe (there are 
about 2 266 of them) were a 1000 gigahertz computer that could evaluate P* 
for a trillion (2 40 ) states every second, and if we ran those 2 266 computers for 
a time equal to the age of the universe (2 58 seconds), they would still only 
visit 2 364 states. We'd have to wait for more than 2 636 ~ 10 190 universe ages 
to elapse before all 2 1000 states had been visited. 

Systems with 2 1000 states are two a penny* One example is a collection 
of 1000 spins such as a 30 x 30 fragment of an Ising model whose probability 
distribution is proportional to 



P*(x) = exp[-/3£(x) 



where x n € {±1} and 



E(x) 



m,n 



(29.13) 



(29.14) 



The energy function E(x) is readily evaluated for any x. But if we wish to 
evaluate this function at all states x, the computer time required would be 
2 1000 function evaluations. 

The Ising model is a simple model which has been around for a long time, 
but the task of generating samples from the distribution P(x) = P*(x)/Z is 
still an active research area; the first 'exact' samples from this distribution 
were created in the pioneering work of Propp and Wilson (1996), as we'll 
describe in Chapter 32. 



* Translation for American 
readers: 'such systems arc a dime 
a dozen'; incidentally, this 
equivalence (10c = 6p) shows that 
the correct exchange rate between 
our currencies is £1.00 = $1.67. 




P*(x) 



A useful analogy 

Imagine the tasks of drawing random water samples from a lake and finding 
the average plankton concentration (figure 29.2). The depth of the lake at 
x = (x,y) is -P*(x), and we assert (in order to make the analogy work) that 
the plankton concentration is a function of x, </>(x). The required average 
concentration is an integral like (29.3), namely 



<D = (0(x)> 



P*(x)0(x), 



(29.15) 



Figure 29.2. A lake whose depth 
at x = (x,y) is P*(x). 
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where Z = Jdxdy -P*(x) is the volume of the lake. You are provided with a 
boat, a satellite navigation system, and a plumbline. Using the navigator, you 
can take your boat to any desired location x on the map; using the plumbline 
you can measure -P*(x) at that point. You can also measure the plankton 
concentration there. 

Problem 1 is to draw 1 cm 3 water samples at random from the lake, in 
such a way that each sample is equally likely to come from any point within 
the lake. Problem 2 is to find the average plankton concentration. 

These are difficult problems to solve because at the outset we know nothing 
about the depth P* (x) . Perhaps much of the volume of the lake is contained 
in narrow, deep underwater canyons (figure 29.3), in which case, to correctly 
sample from the lake and correctly estimate $ our method must implicitly 
discover the canyons and find their volume relative to the rest of the lake. 
Difficult problems, yes; nevertheless, we'll see that clever Monte Carlo methods 
can solve them. 




Figure 29.3. A slice through a lake 
that includes some canyons. 



Uniform sampling 

Having accepted that we cannot exhaustively visit every location x in the 
state space, we might consider trying to solve the second problem (estimating 
the expectation of a function 4>{x)) by drawing random samples {x' r '}^_j 
uniformly from the state space and evaluating -P*(x) at those points. Then 
we could introduce a normalizing constant Zr, defined by 

R 

Z R = Y J P\^ (r) ), (29-16) 

r=l 

and estimate $ = f d x </>(x)P(x) by 

* = f>(* W )^^- (29.17) 



r=l 



Is anything wrong with this strategy? Well, it depends on the functions 0(x) 
and P *(x). Let us assume that 0(x) is a benign, smoothly varying function 
and concentrate on the nature of -P*(x). As we learnt in Chapter 4, a high- 
dimensional distribution is often concentrated in a small region of the state 
space known as its typical set T, whose volume is given by \T\ ~ 2 H W, where 
-ff(X) is the entropy of the probability distribution -P(x). If almost all the 
probability mass is located in the typical set and </>(x) is a benign function, 
the value of $ = f d x cf>(x.)P(x.) will be principally determined by the values 
that 0(x) takes on in the typical set. So uniform sampling will only stand 
a chance of giving a good estimate of $ if we make the number of samples 
R sufficiently large that we are likely to hit the typical set at least once or 
twice. So, how many samples are required? Let us take the case of the Ising 
model again. (Strictly, the Ising model may not be a good example, since it 
doesn't necessarily have a typical set, as defined in Chapter 4; the definition 
of a typical set was that all states had log probability close to the entropy, 
which for an Ising model would mean that the energy is very close to the 
mean energy; but in the vicinity of phase transitions, the variance of energy, 
also known as the heat capacity, may diverge, which means that the energy 
of a random state is not necessarily expected to be very close to the mean 
energy.) The total size of the state space is 2 N states, and the typical set has 
size 2 H . So each sample has a chance of 2 H /2 N of falling in the typical set. 
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64 log(2) 



(a) 





Figure 29.4. (a) Entropy of a 
64-spin Ising model as a function 
of temperature, (b) One state of a 
1024-spin Ising model. 



Temperature 



(b) 



The number of samples required to hit the typical set once is thus of order 

^-2*-*. (29.18) 

So, what is HI At high temperatures, the probability distribution of an Ising 
model tends to a uniform distribution and the entropy tends to i? max = N 
bits, which means i? m ; n is of order 1. Under these conditions, uniform sampling 
may well be a satisfactory technique for estimating $. But high temperatures 
are not of great interest. Considerably more interesting are intermediate tem- 
peratures such as the critical temperature at which the Ising model melts from 
an ordered phase to a disordered phase. The critical temperature of an infinite 
Ising model, at which it melts, is 9 C = 2.27. At this temperature the entropy 
of an Ising model is roughly N/2 bits (figure 29.4). For this probability dis- 
tribution the number of samples required simply to hit the typical set once is 
of order 

R mm ~2 N - N / 2 = 2 N / 2 , (29.19) 

which for N = 1000 is about 10 150 . This is roughly the square of the number 
of particles in the universe. Thus uniform sampling is utterly useless for the 
study of Ising models of modest size. And in most high-dimensional problems, 
if the distribution P(x) is not actually uniform, uniform sampling is unlikely 
to be useful. 



Overview 

Having established that drawing samples from a high-dimensional distribution 
P(x) = P*(x)/Z is difficult even if P*(x) is easy to evaluate, we will now 
study a sequence of more sophisticated Monte Carlo methods: importance 
sampling, rejection sampling, the Metropolis method, Gibhs sampling, and 
siice sampling. 

► 29.2 Importance sampling 

Importance sampling is not a method for generating samples from P(x) (prob- 
lem 1); it is just a method for estimating the expectation of a function c/>(x) 
(problem 2). It can be viewed as a generalization of the uniform sampling 
method. 

For illustrative purposes, let us imagine that the target distribution is a 
one-dimensional density P(x). Let us assume that we are able to evaluate this 
density at any chosen point x, at least to within a multiplicative constant; 
thus we can evaluate a function P*(x) such that 



P{x) = P*(x)/Z. 



(29.20) 
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But P(x) is too complicated a function for us to be able to sample from it 
directly. We now assume that we have a simpler density Q(x) from which we 
can generate samples and which we can evaluate to within a multiplicative 
constant (that is, we can evaluate Q*(x), where Q(x) = Q*{x)/Zq). An 
example of the functions P* , Q* and <fi is shown in figure 29.5. We call Q the 
sampler density. 

In importance sampling, we generate R samples {x^}f =1 from Q(x). If 
these points were samples from P(x) then we could estimate $ by equa- 
tion (29.6). But when we generate samples from Q, values of x where Q(x) is 
greater than P(x) will be over-represented in this estimator, and points where 
Q(x) is less than P(x) will be under-represented. To take into account the 
fact that we have sampled from the wrong distribution, we introduce weights 



w r 



P*(z (r) ) 
Q*(x(*)) 



(29.21) 



which we use to adjust the 'importance' of each point in our estimator thus: 



(29.22) 



t> bxercise 

29.1. [2, p.384] prove that; if 

is non-zero for all x where P{x) is 
non-zero, the estimator <3f> converges to $, the mean value of <f>(x), as R 
increases. What is the variance of this estimator, asymptotically? Hint: 
consider the statistics of the numerator and the denominator separately. 
Is the estimator $ an unbiased estimator for small 7?? 

A practical difficulty with importance sampling is that it is hard to estimate 
how reliable the estimator $ is. The variance of the estimator is unknown 
beforehand, because it depends on an integral over £ of a function involving 
P*(x). And the variance of $ is hard to estimate, because the empirical 
variances of the quantities w r and w r <j){x^) are not necessarily a good guide 
to the true variances of the numerator and denominator in equation (29.22). 
If the proposal density Q{x) is small in a region where \4>(x)P*(x)\ is large 
then it is quite possible, even after many points x^ r ' have been generated, that 
none of them will have fallen in that region. In this case the estimate of $ 
would be drastically wrong, and there would be no indication in the empirical 
variance that the true variance of the estimator $ is large. 




10000 100000 1000000 





Figure 29.5. Functions involved in 
importance sampling. We wish to 
estimate the expectation of <f>(x) 
under P(x) oc P*(x). We can 
generate samples from the simpler 
distribution Q(x) tx Q*(x). We 
can evaluate Q* and P* at any 
point. 



Figure 29.6. Importance sampling 
in action: (a) using a Gaussian 
sampler density; (b) using a 
Cauchy sampler density. Vertical 
axis shows the estimate $. The 
horizontal line indicates the true 
value of <&. Horizontal axis shows 
number of samples on a log scale. 



1000 10000 100000 1000000 



Cautionary illustration of importance sampling 

In a toy problem related to the modelling of amino acid probability distribu- 
tions with a one-dimensional variable x, I evaluated a quantity of interest us- 
ing importance sampling. The results using a Gaussian sampler and a Cauchy 
sampler are shown in figure 29.6. The horizontal axis shows the number of 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



29.2: Importance sampling 



363 



samples on a log scale. In the case of the Gaussian sampler, after about 500 
samples had been evaluated one might be tempted to call a halt; but evidently 
there are infrequent samples that make a huge contribution to <1>, and the value 
of the estimate at 500 samples is wrong. Even after a million samples have 
been taken, the estimate has still not settled down close to the true value. In 
contrast, the Cauchy sampler does not suffer from glitches; it converges (on 
the scale shown here) after about 5000 samples. 

This example illustrates the fact that an importance sampler should have 
heavy tails. 

Exercise 29.2.^' P- 385 l Consider the situation where P*(x) is multimodal, con- 
sisting of several widely-separated peaks. (Probability distributions like 
this arise frequently in statistical data modelling.) Discuss whether it is 
a wise strategy to do importance sampling using a sampler Q(x) that 
is a unimodal distribution fitted to one of these peaks. Assume that 
the function <p{x) whose mean $ is to be estimated is a smoothly vary- 
ing function of x such as mx + c. Describe the typical evolution of the 
estimator $ as a function of the number of samples R. 
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Figure 29.7. A multimodal 
distribution P*(x) and a unimodal 
sampler Q(x). 



Importance sampling in many dimensions 

We have already observed that care is needed in one-dimensional importance 
sampling problems. Is importance sampling a useful technique in spaces of 
higher dimensionality, say N = 1000? 

Consider a simple case-study where the target density P(x) is a uniform 
distribution inside a sphere, 

" 1 < p(x) < R P 
p(x) > R P , 



P*(x) 



(29.23) 



where p(x) = i^ZiX 2 ) 1 / 2 , an d the proposal density is a Gaussian centred on 
the origin, 

Q(x) = Y[ Normal^; 0,ct 2 ). (29.24) 



An importance-sampling method will be in trouble if the estimator $ is dom- 
inated by a few large weights w r . What will be the typical range of values of 
the weights w r 7 We know from our discussions of typical sequences in Part I - 
see exercise 6.14 (p. 124), for example - that if p is the distance from the origin 
of a sample from Q, the quantity p 2 has a roughly Gaussian distribution with 
mean and standard deviation: 



N<r ± V2Na 2 



(29.25) 



Thus almost all samples from Q lie in a typical set with distance from the origin 
very close to \/N a. Let us assume that a is chosen such that the typical set 
of Q lies inside the sphere of radius Rp. [If it does not, then the law of large 
numbers implies that almost all the samples generated from Q will fall outside 
Rp and will have weight zero.] Then we know that most samples from Q will 
have a value of Q that lies in the range 



1 




(2^2)^/2 6XP \ 2 - 2 
Thus the weights w r = P* /Q will typically have values in the range 



(29.26) 



(29.27) 
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(a) 



X cQ*{x) 
/P*(x) \ 



00 




/' X cQ*(x) 




So if we draw a hundred samples, what will the typical range of weights be? 
We can roughly estimate the ratio of the largest weight to the median weight 
by doubling the standard deviation in equation (29.27). The largest weight 
and the median weight will typically be in the ratio: 



Figure 29.8. Rejection sampling, 
(a) The functions involved in 
rejection sampling. We desire 
samples from P(x) oc P*(x). We 
are able to draw samples from 
Q(x) oc Q*(x), and we know a 
value c such that cQ*(x) > P*(x) 
for all x. (b) A point (x, u) is 
generated at random in the lightly 
shaded area under the curve 
cQ*(x). If this point also lies 
below P*(x) then it is accepted. 



w" 



,,med 



exp 



(29.28) 



In TV = 1000 dimensions therefore, the largest weight after one hundred sam- 
ples is likely to be roughly 10 19 times greater than the median weight. Thus an 
importance sampling estimate for a high-dimensional problem will very likely 
be utterly dominated by a few samples with huge weights. 

In conclusion, importance sampling in high dimensions often suffers from 
two difficulties. First, we need to obtain samples that lie in the typical set of P, 
and this may take a long time unless Q is a good approximation to P. Second, 
even if we obtain samples in the typical set, the weights associated with those 
samples are likely to vary by large factors, because the probabilities of points 
in a typical set, although similar to each other, still differ by factors of order 
exp(v / TV), so the weights will too, unless Q is a near-perfect approximation to 
P. 



► 29.3 Rejection sampling 

We assume again a one-dimensional density P(x) = P*(x)/Z that is too com- 
plicated a function for us to be able to sample from it directly. We assume 
that we have a simpler proposal density Q(x) which we can evaluate (within a 
multiplicative factor Zq, as before), and from which we can generate samples. 
We further assume that we know the value of a constant c such that 

cQ*{x) > P*(x), for all x. (29.29) 

A schematic picture of the two functions is shown in figure 29.8a. 

We generate two random numbers. The first, x, is generated from the 
proposal density Q(x). We then evaluate cQ*(x) and generate a uniformly 
distributed random variable u from the interval [0, cQ*(x)). These two random 
numbers can be viewed as selecting a point in the two-dimensional plane as 
shown in figure 29.8b. 

We now evaluate P*(x) and accept or reject the sample x by comparing the 
value of u with the value of P*(x). If u > P*(x) then x is rejected; otherwise 
it is accepted, which means that we add x to our set of samples {x^}. The 
value of u is discarded. 

Why does this procedure generate samples from P(x)l The proposed point 
(x, u) comes with uniform probability from the lightly shaded area underneath 
the curve cQ*(x) as shown in figure 29.8b. The rejection rule rejects all the 
points that lie above the curve P*(x). So the points (x, u) that are accepted 
are uniformly distributed in the heavily shaded area under P*(x). This implies 
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that the probability density of the x-coordinates of the accepted points must 
be proportional to P*(x), so the samples must be independent samples from 
P(x). 

Rejection sampling will work best if Q is a good approximation to P. If Q 
is very different from P then, for c Q to exceed P everywhere, c will necessarily 
have to be large and the frequency of rejection will be large. 

Rejection sampling in many dimensions 

In a high-dimensional problem it is very likely that the requirement that cQ* 
be an upper bound for P* will force c to be so huge that acceptances will be 
very rare indeed. Finding such a value of c may be difficult too, since in many 
problems we know neither where the modes of P* are located nor how high 
they are. 

As a case study, consider a pair of ^-dimensional Gaussian distributions 
with mean zero (figure 29.9). Imagine generating samples from one with stan- 
dard deviation oq and using rejection sampling to obtain samples from the 
other whose standard deviation is up. Let us assume that these two standard 
deviations are close in value - say, oq is 1% larger than Op. [<Jq must be larger 
than ap because if this is not the case, there is no c such that cQ exceeds P 
for all x.] So, what value of c is required if the dimensionality is N = 1000? 
The density of Q(x) at the origin is 1/ (2ikJq) n / 2 , so for cQ to exceed P we 
need to set 



c = 



(2^)^/2 



exp 



TV In 



Op 



(29.30) 



With N = 1000 and ^ = 1.01, we find c = exp(10) ~ 20,000. What will the 
acceptance rate be for this value of c? The answer is immediate: since the 
acceptance rate is the ratio of the volume under the curve -P(x) to the volume 
under cQ(x), the fact that P and Q are both normalized here implies that 
the acceptance rate will be 1/c, for example, 1/20,000. In general, c grows 
exponentially with the dimensionality N, so the acceptance rate is expected 
to be exponentially small in N. 

Rejection sampling, therefore, whilst a useful method for one-dimensional 
problems, is not expected to be a practical technique for generating samples 
from high-dimensional distributions -P(x). 



► 29.4 The Metropolis-Hastings method 

Importance sampling and rejection sampling work well only if the proposal 
density Q(x) is similar to P(x). In large and complex problems it is difficult 
to create a single density Q(x) that has this property. 

The Metropolis-Hastings algorithm instead makes use of a proposal den- 
sity Q which depends on the current state x^K The density Q(x'; x^>) might 
be a simple distribution such as a Gaussian centred on the current x^ . The 
proposal density Q(x';x) can be any fixed density from which we can draw 
samples. In contrast to importance sampling and rejection sampling, it is not 
necessary that Q(x'; x®) look at all similar to P(x) in order for the algorithm 
to be practically useful. An example of a proposal density is shown in fig- 
ure 29.10; this figure shows the density Q(x';x^') for two different states a^ 1 ' 
and x^. 

As before, we assume that we can evaluate P*(x) for any x. A tentative 
new state x' is generated from the proposal density Q(x';x^'). To decide 









\ cQ(x) - 







Figure 29.9. A Gaussian P(x) and 
a slightly broader Gaussian Q(x) 
scaled up by a factor c such that 
cQ(x) > P(x). 




,(D 




x (2) 

Figure 29.10. Metropolis-Hastings 
method in one dimension. The 
proposal distribution Q(x'; x) is 
here shown as having a shape that 
changes as x changes, though this 
is not typical of the proposal 
densities used in practice. 
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whether to accept the new state, we compute the quantity 

PV)Q(^W) 

P\xW)Q{x';xWy ( ' 

If a > 1 then the new state is accepted. 

Otherwise, the new state is accepted with probability a. 

If the step is accepted, we set = x'. 

If the step is rejected, then we set a;(* +1 ) = x^\ 

Note the difference from rejection sampling: in rejection sampling, rejected 
points are discarded and have no influence on the list of samples {x^} that 
we collected. Here, a rejection causes the current state to be written again 
onto the list. 

Notation. I have used the superscript r = 1, . . . ,R to label points that 
are independent samples from a distribution, and the superscript t = 1, . . . , T 
to label the sequence of states in a Markov chain. It is important to note that 
a Metropolis-Hastings simulation of T iterations does not produce T indepen- 
dent samples from the target distribution P. The samples are dependent. 

To compute the acceptance probability (29.31) we need to be able to com- 
pute the probability ratios P(x') / 'P(x®) and Q(x^;x')/Q(x';x^). If the 
proposal density is a simple symmetrical density such as a Gaussian centred on 
the current point, then the latter factor is unity, and the Metropolis-Hastings 
method simply involves comparing the value of the target density at the two 
points. This special case is sometimes called the Metropolis method. How- 
ever, with apologies to Hastings, I will call the general Metropolis-Hastings 
algorithm for asymmetric Q 'the Metropolis method' since I believe important 
ideas deserve short names. 



Convergence of the Metropolis method to the target density 

It can be shown that for any positive Q (that is, any Q such that Q(x';x) > 
for all x,x'), as t — > oo, the probability distribution of x^> tends to P(x) = 
P*(x)/Z. [This statement should not be seen as implying that Q has to assign 
positive probability to every point x' - we will discuss examples later where 
Q(x';x) = for some x,x'; notice also that we have said nothing about how 
rapidly the convergence to P(x) takes place.] 

The Metropolis method is an example of a Markov chain Monte Carlo 
method (abbreviated MCMC). In contrast to rejection sampling, where the 
accepted points {x^} are independent samples from the desired distribution, 
Markov chain Monte Carlo methods involve a Markov process in which a se- 
quence of states {x^} is generated, each sample x^> having a probability 
distribution that depends on the previous value, x^ 1 ^. Since successive sam- 
ples are dependent, the Markov chain may have to be run for a considerable 
time in order to generate samples that are effectively independent samples 
from P. 

Just as it was difficult to estimate the variance of an importance sampling 
estimator, so it is difficult to assess whether a Markov chain Monte Carlo 
method has 'converged', and to quantify how long one has to wait to obtain 
samples that are effectively independent samples from P. 



Demonstration of the Metropolis method 

The Metropolis method is widely used for high-dimensional problems. Many 
implementations of the Metropolis method employ a proposal distribution 
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Figure 29.11. Metropolis method 
in two dimensions, showing a 
traditional proposal density that 
has a sufficiently small step size e 
that the acceptance frequency will 
be about 0.5. 



with a length scale e that is short relative to the longest length scale L of the 
probable region (figure 29.11). A reason for choosing a small length scale is 
that for most high-dimensional problems, a large random step from a typical 
point (that is, a sample from P(x)) is very likely to end in a state that has 
very low probability; such steps are unlikely to be accepted. If e is large, 
movement around the state space will only occur when such a transition to a 
low-probability state is actually accepted, or when a large random step chances 
to land in another probable state. So the rate of progress will be slow if large 
steps are used. 

The disadvantage of small steps, on the other hand, is that the Metropolis 
method will explore the probability distribution by a random walk, and a 
random walk takes a long time to get anywhere, especially if the walk is made 
of small steps. 

Exercise 29.3. ^ Consider a one-dimensional random walk, on each step of 
which the state moves randomly to the left or to the right with equal 
probability. Show that after T steps of size e, the state is likely to have 
moved only a distance about VTe. (Compute the root mean square 
distance travelled.) 

Recall that the first aim of Monte Carlo sampling is to generate a number of 
independent samples from the given distribution (a dozen, say). If the largest 
length scale of the state space is L, then we have to simulate a random-walk 
Metropolis method for a time T ~ {L/e) 2 before we can expect to get a sample 
that is roughly independent of the initial condition - and that's assuming that 
every step is accepted: if only a fraction / of the steps are accepted on average, 
then this time is increased by a factor 1/f. 



Rule of thumb: lower bound on number of iterations of a 
Metropolis method. If the largest length scale of the space of 
probable states is L, a Metropolis method whose proposal distribu- 
tion generates a random walk with step size e must be run for at 
least 

T ~ (L/e) 2 (29.32) 
iterations to obtain an independent sample. 



This rule of thumb gives only a lower bound; the situation may be much 
worse, if, for example, the probability distribution consists of several islands 
of high probability separated by regions of low probability. 
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(b) Metropolis 



(c) Independent sampling 
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Figure 29.12. Metropolis method 
for a toy problem, (a) The state 
sequence for t = 1, . . . , 600. 
Horizontal direction = states from 
to 20; vertical direction = time 
from 1 to 600; the cross bars mark 
time intervals of duration 50. (b) 
Histogram of occupancy of the 
states after 100, 400 and 1200 
iterations, (c) For comparison, 
histograms resulting when 
successive points are drawn 
independently from the target 
distribution. 
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To illustrate how slowly a random walk explores a state space, figure 29.12 
shows a simulation of a Metropolis algorithm for generating samples from the 
distribution: 

P{x) = { V21 xe {0,1, 2,..., 20} 
{0 otherwise. 

The proposal distribution is 

Q^) = {t oter^e 1 ^ 

Because the target distribution P{x) is uniform, rejections occur only when 
the proposal takes the state to x' = —1 or x' = 21. 

The simulation was started in the state xq = 10 and its evolution is shown 
in figure 29.12a. How long does it take to reach one of the end states x = 
and x = 20? Since the distance is 10 steps, the rule of thumb (29.32) predicts 
that it will typically take a time T ~ 100 iterations to reach an end state. This 
is confirmed in the present example: the first step into an end state occurs on 
the 178th iteration. How long does it take to visit both end states? The rule 
of thumb predicts about 400 iterations are required to traverse the whole state 
space; and indeed the first encounter with the other end state takes place on 
the 540th iteration. Thus effectively-independent samples are only generated 
by simulating for about four hundred iterations per independent sample. 

This simple example shows that it is important to try to abolish random 
walk behaviour in Monte Carlo methods. A systematic exploration of the toy 
state space {0, 1, 2, ... , 20} could get around it, using the same step sizes, in 
about twenty steps instead of four hundred. Methods for reducing random 
walk behaviour are discussed in the next chapter. 

Metropolis method in high dimensions 

The rule of thumb (29.32), which gives a lower bound on the number of itera- 
tions of a random walk Metropolis method, also applies to higher dimensional 
problems. Consider the simple case of a target distribution that is an N- 
dimensional Gaussian, and a proposal distribution that is a spherical Gaussian 
of standard deviation e in each direction. Without loss of generality, we can 
assume that the target distribution is a separable distribution aligned with the 
axes {x n }, and that it has standard deviation o n in direction n. Let ,r max and 
0-min k e the largest and smallest of these standard deviations. Let us assume 
that e is adjusted such that the acceptance frequency is close to 1. Under this 
assumption, each variable x n evolves independently of all the others, executing 
a random walk with step size about e. The time taken to generate effectively 
independent samples from the target distribution will be controlled by the 
largest lengthscale <r max . Just as in the previous section, where we needed at 
least T ~ (L/e) 2 iterations to obtain an independent sample, here we need 
T~ (ff max /e) 2 . 

Now, how big can e be? The bigger it is, the smaller this number T be- 
comes, but if e is too big - bigger than cr mln - then the acceptance rate will 
fall sharply. It seems plausible that the optimal e must be similar to cr mm . 
Strictly, this may not be true; in special cases where the second smallest a n 
is significantly greater than cr mm , the optimal e may be closer to that second 
smallest a n . But our rough conclusion is this: where simple spherical pro- 
posal distributions are used, we will need at least T ~ (<7 max /er mln ) 2 iterations 
to obtain an independent sample, where ,r max and er mln are the longest and 
shortest lengthscales of the target distribution. 
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Figure 29.13. Gibbs sampling, 
(a) The joint density P(x) from 
which samples are required, (b) 
Starting from a state xW , x\ is 
sampled from the conditional 
density P{x\ \ x^). (c) A sample 
is then made from the conditional 
density P(x 2 \xi)- (d) A couple of 
iterations of Gibbs sampling. 



This is good news and bad news. It is good news because, unlike the 
cases of rejection sampling and importance sampling, there is no catastrophic 
dependence on the dimensionality N. Our computer will give useful answers 
in a time shorter than the age of the universe. But it is bad news all the same, 
because this quadratic dependence on the lengthscale-ratio may still force us 
to make very lengthy simulations. 

Fortunately, there are methods for suppressing random walks in Monte 
Carlo simulations, which we will discuss in the next chapter. 

► 29.5 Gibbs sampling 

We introduced importance sampling, rejection sampling and the Metropolis 
method using one-dimensional examples. Gibbs sampling, also known as the 
heat bath method or 'Glauber dynamics', is a method for sampling from dis- 
tributions over at least two dimensions. Gibbs sampling can be viewed as a 
Metropolis method in which a sequence of proposal distributions Q are defined 
in terms of the conditional distributions of the joint distribution -P(x). It is 
assumed that, whilst -P(x) is too complex to draw samples from directly, its 
conditional distributions P(xi \ {xj}j^i) are tractable to work with. For many 
graphical models (but not all) these one-dimensional conditional distributions 
are straightforward to sample from. For example, if a Gaussian distribution 
for some variables d has an unknown mean m, and the prior distribution of m 
is Gaussian, then the conditional distribution of m given d is also Gaussian. 
Conditional distributions that are not of standard form may still be sampled 
from by adaptive rejection sampling if the conditional distribution satisfies 
certain convexity properties (Gilks and Wild, 1992). 

Gibbs sampling is illustrated for a case with two variables (x 1,0:2) = x 
in figure 29.13. On each iteration, we start from the current state x^, and 
xi is sampled from the conditional density P(x\ \ x2), with X2 fixed to ■ 
A sample Xi is then made from the conditional density P(x2 \xi), using the 
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29.5: Gihhs sampling 

new value of x\. This brings us to the new state x'* +1 ', and completes the 
iteration. 

In the general case of a system with K variables, a single iteration involves 
sampling one parameter at a time: 

xf +l) ~ P{x 1 \x [ 2 t) ,xf,...,x { ^) (29.35) 
~ P{x 2 \4 +l \x { f\...,xf) (29.36) 
4 +1) ~ P(x3|*r i) ,^ 1) ,-.4 ) ). etc. (29.37) 

Convergence of Gibbs sampling to the target density 

> Exercise 29.4. ^ Show that a single variable-update of Gibbs sampling can 

be viewed as a Metropolis method with target density -P(x), and that 
this Metropolis method has the property that every proposal is always 
accepted. 

Because Gibbs sampling is a Metropolis method, the probability distribution 
of x'*' tends to -P(x) as t — > oo, as long as P(x) does not have pathological 
properties. 

> Exercise 29.5.^' p ' 385 l Discuss whether the syndrome decoding problem for a 

(7, 4) Hamming code can be solved using Gibbs sampling. The syndrome 
decoding problem, if we are to solve it with a Monte Carlo approach, 
is to draw samples from the posterior distribution of the noise vector 
n = (m, ... ,n n ,-.. ,n N ), 

1 N 

P(n | f , z) = - n - fj 1 -^ l[Hn = z], (29.38) 

71=1 

where /„ is the normalized likelihood for the nth transmitted bit and z 
is the observed syndrome. The factor 1 [Hn = z] is 1 if n has the correct 
syndrome z and otherwise. 

What about the syndrome decoding problem for any linear error-correcting 
code? 



Gibbs sampling in high dimensions 

Gibbs sampling suffers from the same defect as simple Metropolis algorithms 
- the state space is explored by a slow random walk, unless a fortuitous pa- 
rameterization has been chosen that makes the probability distribution -P(x) 
separable. If, say, two variables x\ and x-i are strongly correlated, having 
marginal densities of width L and conditional densities of width e, then it will 
take at least about {L/e) 2 iterations to generate an independent sample from 
the target density. Figure 30.3, p. 390, illustrates the slow progress made by 
Gibbs sampling when L»f. 

However Gibbs sampling involves no adjustable parameters, so it is an at- 
tractive strategy when one wants to get a model running quickly. An excellent 
software package, BUGS, makes it easy to set up almost arbitrary probabilistic 
models and simulate them by Gibbs sampling (Thomas et ai, 1992). 1 



1 http : //www. mrc-bsu. cam. ac.uk/bugs/ 
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► 29.6 Terminology for Markov chain Monte Carlo methods 



We now spend a few moments sketching the theory on which the Metropolis 
method and Gibbs sampling are based. We denote by p^(x) the probabil- 
ity distribution of the state of a Markov chain simulator. (To visualize this 
distribution, imagine running an infinite collection of identical simulators in 
parallel.) Our aim is to find a Markov chain such that as t — > oo, p' ( '(x) tends 
to the desired distribution -P(x). 

A Markov chain can be specified by an initial probability distribution 
p(°)(x) and a transition probability T(x';x). 

The probability distribution of the state at the (t+l)th iteration of the 
Markov chain, p(' +1 )(x), is given by 



P 



(*+i) 



V) = yd w x T(x';x)pW(x). 



(29.39) 



Example 29.6. An example of a Markov chain is given by the Metropolis 
demonstration of section 29.4 (figure 29.12), for which the transition proba- 
bility is 

}/?V2. 



Z/ 2 /2 i/ ■ ■ 

V2 • V2 

■ 12 M 

■ ■ V2 



, V2. ■ ■ 
2 ■ 72 •, ■ 

■ 72 ■ 72. 



1/2 



V2 



• V2 

2-72 

1/2.1/2. .... 

■ /2 1/ /2 1/ ■ ■ ■ 
. . 1/2 1/2 • ■ ■ 

■ ■ ■ /2 1/ /2 1/ ■ 
.... 1/2 . 1/2 

V2 1/2 

V2 1/2 

::::::: h vM 



and the initial distribution was 
P W (x) =[ 



(29.40) 



The probability distribution p« (x) of the state at the tth iteration is shown 
for t = 0, 1, 2, 3, 5, 10, 100, 200, 400 in figure 29.14; an equivalent sequence of 
distributions is shown in figure 29.15 for the chain that begins in initial state 
xq = 17. Both chains converge to the target density, the uniform density, as 
t — > oo. 



Required properties 

When designing a Markov chain Monte Carlo method, we construct a chain 
with the following properties: 

1. The desired distribution P(x) is an invariant distribution of the chain. 

A distribution 7r(x) is an invariant distribution of the transition proba- 
bility T(x';x) if 

7r(x') = y'd Ar x T(x / ;x) 7 r(x). (29.41) 

An invariant distribution is an eigenvector of the transition probability 
matrix that has eigenvalue 1. 
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p {a \x 



5 10 15 20 




5 10 15 20 



p (*) aAAAa 
, — v , v vv — , 

5 10 15 20 

p (100) (-) -vWWWv- 

5 10 15 20 

5 10 15 20 

p< 400 >(z) - 

5 10 15 20 

Figure 29.14. The probability 
distribution of the state of the 
Markov chain of example 29.6. 
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2. The chain must also be ergodic, that is, 



-,(*) 



(x) — > 7r(x) as t — > oo, for any p(°'(x) 



(29.42) 



A couple of reasons why a chain might not be ergodic are: 



(a) Its matrix might be reducible, which means that the state space 
contains two or more subsets of states that can never be reached 
from each other. Such a chain has many invariant distributions; 
which one p(*'(x) would tend to as t — > oo would depend on the 
initial condition p(°'(x). 

The transition probability matrix of such a chain has more than 
one eigenvalue equal to 1. 

(b) The chain might have a periodic set, which means that, for some 
initial conditions, ])W(x) doesn't tend to an invariant distribution, 
but instead tends to a periodic limit-cycle. 

A simple Markov chain with this property is the random walk on the 
Af-dimensional hypercube. The chain T takes the state from one 
corner to a randomly chosen adjacent corner. The unique invariant 
distribution of this chain is the uniform distribution over all 2 N 
states, but the chain is not ergodic; it is periodic with period two: 
if we divide the states into states with odd parity and states with 
even parity, we notice that every odd state is surrounded by even 
states and vice versa. So if the initial condition at time t = is a 
state with even parity, then at time t = 1 - and at all odd times 
- the state must have odd parity, and at all even times, the state 
will be of even parity. 

The transition probability matrix of such a chain has more than 
one eigenvalue with magnitude equal to 1. The random walk on 
the hypercube, for example, has eigenvalues equal to +1 and —1. 



p(°'(x) 



p (1 \x) 



p^(x) 



p^(x) 



„(ioo) 



„(200) 



„(«>()) 



(x) 



(x) 



(x) 



Figure 29.15. The probability 
distribution of the state of the 
Markov chain for initial condition 
x = 17 (example 29.6 (p.372)). 



Methods of construction of Markov chains 

It is often convenient to construct T by mixing or concatenating simple base 
transitions B all of which satisfy 



(29.43) 



for the desired density P(x), i.e., they all have the desired density as an 
invariant distribution. These base transitions need not individually be ergodic. 

T is a mixture of several base transitions £>{,(x', x) if we make the transition 
by picking one of the base transitions at random, and allowing it to determine 
the transition, i.e., 

T(x / ,x) = ^ Pb B b (x / ,x), (29.44) 



where {pt} is a probability distribution over the base transitions. 

T is a concatenation of two base transitions £?i(x',x) and Z?2(x',x) if we 
first make a transition to an intermediate state x" using B\, and then make a 
transition from state x" to x' using £>2- 

T(x',x) = J d^x" B 2 (x',x")5i(x",x). (29.45) 
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Detailed balance 

Many useful transition probabilities satisfy the detailed balance property: 

T(x a ; x b )P(x 6 ) = T(x 6 ; x a )P(x a ), for all x 6 and x Q . (29.46) 

This equation says that if we pick (by magic) a state from the target density 
P and make a transition under T to another state, it is just as likely that we 
will pick x/, and go from Xf, to x n as it is that we will pick x a and go from x 
to X(,. Markov chains that satisfy detailed balance are also called reversible 
Markov chains. The reason why the detailed balance property is of interest 
is that detailed balance implies invariance of the distribution P(x) under the 
Markov chain T, which is a necessary condition for the key property that we 
want from our MCMC simulation - that the probability distribution of the 
chain should converge to -P(x). 

> Exercise 29.7. ^ Prove that detailed balance implies invariance of the distri- 
bution P(x) under the Markov chain T. 

Proving that detailed balance holds is often a key step when proving that a 
Markov chain Monte Carlo simulation will converge to the desired distribu- 
tion. The Metropolis method satisfies detailed balance, for example. Detailed 
balance is not an essential condition, however, and we will see later that ir- 
reversible Markov chains can be useful in practice, because they may have 
different random walk properties. 

\> Exercise 29.8. ^ Show that, if we concatenate two base transitions B\ and E>2 
that satisfy detailed balance, it is not necessarily the case that the T 
thus defined (29.45) satisfies detailed balance. 

Exercise 29.9.^ Does Gibbs sampling, with several variables all updated in a 
deterministic sequence, satisfy detailed balance? 



► 29.7 Slice sampling 

Slice sampling (Neal, 1997a; Neal, 2003) is a Markov chain Monte Carlo 
method that has similarities to rejection sampling, Gibbs sampling and the 
Metropolis method. It can be applied wherever the Metropolis method can 
be applied, that is, to any system for which the target density P*(x) can be 
evaluated at any point x; it has the advantage over simple Metropolis methods 
that it is more robust to the choice of parameters like step sizes. The sim- 
plest version of slice sampling is similar to Gibbs sampling in that it consists of 
one-dimensional transitions in the state space; however there is no requirement 
that the one-dimensional conditional distributions be easy to sample from, nor 
that they have any convexity properties such as are required for adaptive re- 
jection sampling. And slice sampling is similar to rejection sampling in that 
it is a method that asymptotically draws samples from the volume under the 
curve described by P*(x); but there is no requirement for an upper-bounding 
function. 

I will describe slice sampling by giving a sketch of a one-dimensional sam- 
pling algorithm, then giving a pictorial description that includes the details 
that make the method valid. 
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29. 7: Slice sampling 

The skeleton of slice sampling 

Let us assume that we want to draw samples from P{x) oc P*{x) where x 
is a real number. A one-dimensional slice sampling algorithm is a method 
for making transitions from a two-dimensional point (x, u) lying under the 
curve P*(x) to another point (x',u') lying under the same curve, such that 
the probability distribution of (x, u) tends to a uniform distribution over the 
area under the curve P*{x), whatever initial point we start from - like the 
uniform distribution under the curve P*{x) produced by rejection sampling 
(section 29.3). 

A single transition (x,u) — > (x',u') of a one-dimensional slice sampling 
algorithm has the following steps, of which steps 3 and 8 will require further 
elaboration. 



1: 


evaluate P*(x) 


2: 


draw a vertical coordinate u' ~ Uniform(0, P*(x)) 


3: 


create a horizontal interval (xi,x r ) enclosing x 


4: 


loop { 


5: 


draw x' ~ Uniform^;, x r ) 


6: 


evaluate P\x') 


7: 


if P*(x') > u' break out of loop 4-9 


8: 


else modify the interval (x[,x r ) 


9: 


} 



There are several methods for creating the interval (x;,x r ) in step 3, and 
several methods for modifying it at step 8. The important point is that the 
overall method must satisfy detailed balance, so that the uniform distribution 
for (x, u) under the curve P*(x) is invariant. 



The 'stepping out ' method for step 3 

In the 'stepping out' method for creating an interval (xi,x r ) enclosing x, we 
step out in steps of length w until we find endpoints xi and x r at which P* is 
smaller than u. The algorithm is shown in figure 29.16. 

3a: draw r ~ Uniform(0, 1) 

3b : xi : = x — rw 

3c : x r := x + (1 — r)w 

3d: while {P*{xi) > v!) { x t := x x - w } 

3e : while (P*(x r ) > v!) { x r := x r + w } 



The 'shrinking ' method for step 8 

Whenever a point x' is drawn such that (x 1 ,v!) lies above the curve P*{x), 
we shrink the interval so that one of the end points is x' , and such that the 
original point x is still enclosed in the interval. 

8a: if [xl > x) { x r := x' } 
8b : else { X[ : = x' } 



Properties of slice sampling 

Like a standard Metropolis method, slice sampling gets around by a random 
walk, but whereas in the Metropolis method, the choice of the step size is 
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Figure 29.16. Slice sampling. Each 
panel is labelled by the steps of 
the algorithm that are executed in 
it. At step 1, P*(x) is evaluated 
at the current point x. At step 2, 
a vertical coordinate is selected 
giving the point (x,u') shown by 
the box; At steps 3a- c, an 
interval of size w containing 
(x, v!) is created at random. At 
step 3d, P* is evaluated at the left 
end of the interval and is found to 
be larger than v! , so a step to the 
left of size w is made. At step 3e, 
P* is evaluated at the right end of 
the interval and is found to be 
smaller than it', so no stepping 
out to the right is needed. When 
step 3d is repeated, P* is found to 
be smaller than it', so the 
stepping out halts. At step 5 a 
point is drawn from the interval, 
shown by a o. Step 6 establishes 
that this point is above P* and 
step 8 shrinks the interval to the 
rejected point in such a way that 
the original point x is still in the 
interval. When step 5 is repeated, 
the new coordinate x' (which is to 
the right-hand side of the 
interval) gives a value of P* 
greater than u' , so this point x' is 
the outcome at step 7. 
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critical to the rate of progress, in slice sampling the step size is self-tuning. If 

the initial interval size w is too small by a factor / compared with the width of 

the probable region then the stepping-out procedure expands the interval size. 

The cost of this stepping-out is only linear in /, whereas in the Metropolis 1Q 

method the computer-time scales as the square of / if the step size is too 

small. _ 

If the chosen value of w is too large by a factor F then the algorithm 
spends a time proportional to the logarithm of F shrinking the interval down _ 
to the right size, since the interval typically shrinks by a factor in the ballpark 
of 0.6 each time a point is rejected. In contrast, the Metropolis algorithm , _ 

responds to a too-large step size by rejecting almost all proposals, so the rate q— \ \ \ j \ 

of progress is exponentially bad in F. There are no rejections in slice sampling. 

The probability of staying in exactly the same place is very small. Figure 29.17. P*(x). 

t> Exercise 29.10. ^ Investigate the properties of slice sampling applied to the 
density shown in figure 29.17. x is a real variable between 0.0 and 11.0. 
How long does it take typically for slice sampling to get from an x in 
the peak region x £ (0, 1) to an x in the tail region x £ (1, 11), and vice 
versa! Confirm that the probabilities of these transitions do yield an 
asymptotic probability density that is correct. 

How slice sampling is used in real problems 

An iV-dimensional density P(x) oc P*(x) may be sampled with the help of the 
one-dimensional slice sampling method presented above by picking a sequence 
of directions y^\y( 2 \ . . . and defining x = + xy®. The function P*(x) 
above is replaced by P*(x) = P*(x'*' + xy^). The directions may be chosen 
in various ways; for example, as in Gibbs sampling, the directions could be the 
coordinate axes; alternatively, the directions y W may be selected at random 
in any manner such that the overall procedure satisfies detailed balance. 

Computer-friendly slice sampling 

The real variables of a probabilistic model will always be represented in a 
computer using a finite number of bits. In the following implementation of 
slice sampling due to Skilling, the stepping-out, randomization, and shrinking 
operations, described above in terms of floating-point operations, are replaced 
by binary and integer operations. 

We assume that the variable x that is being slice-sampled is represented by 
a 6-bit integer X taking on one of B = 2 b values, 0, 1, 2, ... , B—l, many or all 
of which correspond to valid values of x. Using an integer grid eliminates any 
errors in detailed balance that might ensue from variable-precision rounding of 
floating-point numbers. The mapping from X to x need not be linear; if it is 
nonlinear, we assume that the function P*(x) is replaced by an appropriately 
transformed function - for example, P**(X) cx P*(x)\dx/dX\. 

We assume the following operators on 6-bit integers are available: 



N := randbits(Z) 



X + N 
X - N 

X®N 



arithmetic sum, modulo B, of X and N. 
difference, modulo B, of X and N. 
bitwise exclusive-or of X and N. 
sets N to a random /-bit integer. 



A slice-sampling procedure for integers is then as follows: 
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Given: a current point X and a 


height Y = P*(X) x Uniform(0, 1) < P*{X) 


1: 


U := randbits(6) 


Define a random translation U of the binary coor- 






dinate system. 


2: 


set I to a value I < b 


Set initial /-bit sampling range. 


3: 


do { 




4: 


N := randbits(l) 


Define a random move within the current interval of 






width 2 l . 


5: 


X'-=((X -U)®N) + U 


Randomize the lowest I bits of X (in the translated 






coordinate system). 


6: 


l:=l-l 


If X' is not acceptable, decrease I and try again 


7: 


} until {X 1 = X) or (P*{X') > Y) 


with a smaller perturbation of X; termination at or 






before I = is assured. 



The translation U is introduced to avoid permanent sharp edges, where 
for example the adjacent binary integers 0111111111 and 1000000000 would 
otherwise be permanently in different sectors, making it difficult for X to move 
from one to the other. 

The sequence of intervals from which the new candidate points are drawn 
is illustrated in figure 29.18. First, a point is drawn from the entire interval, 
shown by the top horizontal line. At each subsequent draw, the interval is 
halved in such a way as to contain the previous point X. 

If preliminary stepping-out from the initial range is required, step 2 above 
can be replaced by the following similar procedure: 



2a 


set I to a value I < b 


I sets the initial width 


2b 


do { 




2c 


iV:=randbits(7) 




2d 


X':=((X -U)®N) + U 




2e 


l:=l+l 




2f 


} until (I = b) or (P*(X') < Y) 





These shrinking and stepping out methods shrink and expand by a factor 
of two per evaluation. A variant is to shrink or expand by more than one bit 
each time, setting I := I ± Al with Al > 1. Taking Al at each step from any 
pre-assigned distribution (which may include Al = 0) allows extra flexibility. 

Exercise 29.11.^' In the shrinking phase, after an unacceptable X' has been 
produced, the choice of Al is allowed to depend on the difference between 
the slice's height Y and the value of P*(X'), without spoiling the algo- 
rithm's validity. (Prove this.) It might be a good idea to choose a larger 
value of Al when Y — P*(X') is large. Investigate this idea theoretically 
or empirically. 

A feature of using the integer representation is that, with a suitably ex- 
tended number of bits, the single integer X can represent two or more real 
parameters - for example, by mapping X to {x\, X2, £3) through a space-filling 
curve such as a Peano curve. Thus multi-dimensional slice sampling can be 
performed using the same software as for one dimension. 







B-l 



Figure 29.18. The sequence of 
intervals from which the new 
candidate points are drawn. 
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29.8: Practicalities 

► 29.8 Practicalities 

Can we predict how long a Markov chain Monte Carlo simulation 
will take to equilibrate? By considering the random walks involved in a 
Markov chain Monte Carlo simulation we can obtain simple lower bounds on 
the time required for convergence. But predicting this time more precisely is a 
difficult problem, and most of the theoretical results giving upper bounds on 
the convergence time are of little practical use. The exact sampling methods 
of Chapter 32 offer a solution to this problem for certain Markov chains. 

Can we diagnose or detect convergence in a running simulation? 

This is also a difficult problem. There are a few practical tools available, but 
none of them is perfect (Cowles and Carlin, 1996). 

Can we speed up the convergence time and time between indepen- 
dent samples of a Markov chain Monte Carlo method? Here, there is 
good news, as described in the next chapter, which describes the Hamiltonian 
Monte Carlo method, overrelaxation, and simulated annealing. 

► 29.9 Further practical issues 

Can the normalizing constant be evaluated? 

If the target density P(x) is given in the form of an unnormalized density 
P*(x) with P(x) = ^P*(x), the value of Z may well be of interest. Monte 
Carlo methods do not readily yield an estimate of this quantity, and it is an 
area of active research to find ways of evaluating it. Techniques for evaluating 
Z include: 

1. Importance sampling (reviewed by Neal (1993b)) and annealed impor- 
tance sampling (Neal, 1998). 

2. 'Thermodynamic integration' during simulated annealing, the 'accep- 
tance ratio' method, and 'umbrella sampling' (reviewed by Neal (1993b)). 

3. 'Reversible jump Markov chain Monte Carlo' (Green, 1995). 

One way of dealing with Z, however, may be to find a solution to one's 
task that does not require that Z be evaluated. In Bayesian data modelling 
one might be able to avoid the need to evaluate Z - which would be important 
for model comparison - by not having more than one model. Instead of using 
several models (differing in complexity, for example) and evaluating their rel- 
ative posterior probabilities, one can make a single hierarchical model having, 
for example, various continuous hyperparameters which play a role similar to 
that played by the distinct models (Neal, 1996). In noting the possibility of 
not computing Z , I am not endorsing this approach. The normalizing constant 
Z is often the single most important number in the problem, and I think every 
effort should be devoted to calculating it. 

The Metropolis method for big models 

Our original description of the Metropolis method involved a joint updating 
of all the variables using a proposal density Q(x';x). For big problems it 
may be more efficient to use several proposal distributions Q^(x'; x), each of 
which updates only some of the components of x. Each proposal is individually 
accepted or rejected, and the proposal distributions are repeatedly run through 
in sequence. 
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> Exercise 29.12.^' P- 385 ] Explain why the rate of movement through the state 
space will be greater when B proposals . . . , are considered 

individually in sequence, compared with the case of a single proposal 
Q* defined by the concatenation of . . . , (J^ 5 - 1 . Assume that each 
proposal distribution QW(x';x) has an acceptance rate / < 1/2. 

In the Metropolis method, the proposal density Q(x';x) typically has a 
number of parameters that control, for example, its 'width'. These parameters 
are usually set by trial and error with the rule of thumb being to aim for a 
rejection frequency of about 0.5. It is not valid to have the width parameters 
be dynamically updated during the simulation in a way that depends on the 
history of the simulation. Such a modification of the proposal density would 
violate the detailed balance condition that guarantees that the Markov chain 
has the correct invariant distribution. 

Gibbs sampling in big models 

Our description of Gibbs sampling involved sampling one parameter at a time, 
as described in equations (29.35-29.37). For big problems it may be more 
efficient to sample groups of variables jointly, that is to use several proposal 
distributions: 

xf + ]\ . . , 4*+!) ~ P( Xl , . . . , x a | a; W i ) . . . > x W ) (29.47) 

x {t+l) x [t+1) ~ P(x j-i xu\x (t+l) x^+V x {t) x {t) ) etc 

How many samples are needed? 

At the start of this chapter, we observed that the variance of an estimator <I> 
depends only on the number of independent samples R and the value of 

a 2 = J d w x P(x)(<£(x) - $) 2 . (29.48) 

We have now discussed a variety of methods for generating samples from P(x). 
How many independent samples R should we aim for? 

In many problems, we really only need about twelve independent samples 
from -P(x). Imagine that x is an unknown vector such as the amount of 
corrosion present in each of 10 000 underground pipelines around Cambridge, 
and 0(x) is the total cost of repairing those pipelines. The distribution -P(x) 
describes the probability of a state x given the tests that have been carried out 
on some pipelines and the assumptions about the physics of corrosion. The 
quantity $ is the expected cost of the repairs. The quantity a 2 is the variance 
of the cost - a measures by how much we should expect the actual cost to 
differ from the expectation $. 

Now, how accurately would a manager like to know $? I would suggest 
there is little point in knowing $ to a precision finer than about a/3. After 
all, the true cost is likely to differ by ±cr from <!>. If we obtain R = 12 
independent samples from -P(x), we can estimate $ to a precision of a / \/\2 - 
which is smaller than cr/3. So twelve samples suffice. 

Allocation of resources 

Assuming we have decided how many independent samples R are required, 
an important question is how one should make use of one's limited computer 
resources to obtain these samples. 
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(1) - 

(2) - 

0): 



A typical Markov chain Monte Carlo experiment involves an initial pe- 
riod in which control parameters of the simulation such as step sizes may be 
adjusted. This is followed by a 'burn in' period during which we hope the 
simulation 'converges' to the desired distribution. Finally, as the simulation 
continues, we record the state vector occasionally so as to create a list of states 
{ x ^}^Li that we hope are roughly independent samples from P(x). 

There are several possible strategies (figure 29.19): 



Figure 29.19. Three possible 
Markov chain Monte Carlo 
strategies for obtaining twelve 
samples in a fixed amount of 
computer time. Time is 
represented by horizontal lines; 
samples by white circles. (1) A 
single run consisting of one long 
'burn in' period followed by a 
sampling period. (2) Four 
medium-length runs with different 
initial conditions and a 
medium- length burn in period. 
(3) Twelve short runs. 



1. Make one long run, obtaining all 7? samples from it. 

2. Make a few medium-length runs with different initial conditions, obtain- 
ing some samples from each. 

3. Make R short runs, each starting from a different random initial condi- 
tion, with the only state that is recorded being the final state of each 
simulation. 



The first strategy has the best chance of attaining 'convergence'. The last 
strategy may have the advantage that the correlations between the recorded 
samples are smaller. The middle path is popular with Markov chain Monte 
Carlo experts (Gilks et al, 1996) because it avoids the inefficiency of discarding 
burn-in iterations in many runs, while still allowing one to detect problems 
with lack of convergence that would not be apparent from a single run. 

Finally, I should emphasize that there is no need to make the points in 
the estimate nearly- independent. Averaging over dependent points is fine - it 
won't lead to any bias in the estimates. For example, when you use strategy 
1 or 2, you may, if you wish, include all the points between the first and last 
sample in each run. Of course, estimating the accuracy of the estimate is 
harder when the points are dependent. 



► 29.10 Summary 

• Monte Carlo methods are a powerful tool that allow one to sample from 
any probability distribution that can be expressed in the form P(x) = 

• Monte Carlo methods can answer virtually any query related to -P(x) by 
putting the query in the form 



/ 0(x)P(x)^±£>(xM). 
J r 



(29.49) 
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• In high-dimensional problems the only satisfactory methods are those 
based on Markov chains, such as the Metropolis method, Gibbs sam- 
pling and slice sampling. Gibbs sampling is an attractive method be- 
cause it has no adjustable parameters but its use is restricted to cases 
where samples can be generated from the conditional distributions. Slice 
sampling is attractive because, whilst it has step-length parameters, its 
performance is not very sensitive to their values. 

• Simple Metropolis algorithms and Gibbs sampling algorithms, although 
widely used, perform poorly because they explore the space by a slow 
random walk. The next chapter will discuss methods for speeding up 
Markov chain Monte Carlo simulations. 

• Slice sampling does not avoid random walk behaviour, but it automat- 
ically chooses the largest appropriate step size, thus reducing the bad 
effects of the random walk compared with, say, a Metropolis method 
with a tiny step size. 

► 29.11 Exercises 

Exercise 29.13. ^ ZC ' P- 386 l A study of importance sampling. We already estab- 
lished in section 29.2 that importance sampling is likely to be useless in 
high-dimensional problems. This exercise explores a further cautionary 
tale, showing that importance sampling can fail even in one dimension, 
even with friendly Gaussian distributions. 

Imagine that we want to know the expectation of a function 4>(x) under 
a distribution P(x), 

$ = J dx P(x)4>(x), (29.50) 

and that this expectation is estimated by importance sampling with 
a distribution Q(x). Alternatively, perhaps we wish to estimate the 
normalizing constant Z in P(x) = P*(x)/Z using 

Now, let P{x) and Q(x) be Gaussian distributions with mean zero and 
standard deviations a p and a q . Each point x drawn from Q will have 
an associated weight P*(x)/Q(x). What is the variance of the weights? 
[Assume that P* = P, so P is actually normalized, and Z = 1, though 
we can pretend that we didn't know that.] What happens to the variance 
of the weights as — > <Jp/2? 

Check your theory by simulating this importance-sampling problem on 
a computer. 

Exercise 29.14. ^ Consider the Metropolis algorithm for the one-dimensional 
toy problem of section 29.4, sampling from {0, 1, . . . , 20}. Whenever 
the current state is one of the end states, the proposal density given in 
equation (29.34) will propose with probability 50% a state that will be 
rejected. 

To reduce this 'waste', Fred modifies the software responsible for gen- 
erating samples from Q so that when x = 0, the proposal density is 
100% on x' = 1, and similarly when x = 20, x' = 19 is always proposed. 
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Fred sets the software that implements the acceptance rule so that the 
software accepts all proposed moves. What probability P'(x) will Fred's 
modified software generate samples from? 

What is the correct acceptance rule for Fred's proposal density, in order 
to obtain samples from P{x)l 

> Exercise 29.15. Implement Gibbs sampling for the inference of a single 
one-dimensional Gaussian, which we studied using maximum likelihood 
in section 22.1. Assign a broad Gaussian prior to fi and a broad gamma 
prior (24.2) to the precision parameter (3 = 1/cr 2 . Each update of /j will 
involve a sample from a Gaussian distribution, and each update of a 
requires a sample from a gamma distribution. 



for the inference of a mixture of K one-dimensional Gaussians, which we 
studied using maximum likelihood in section 22.2. Allow the clusters to 
have different standard deviations o^. Assign priors to the means and 
standard deviations in the same way as the previous exercise. Either fix 
the prior probabilities of the classes {irk} to be equal or put a uniform 
prior over the parameters 7r and include them in the Gibbs sampling. 

Notice the similarity of Gibbs sampling to the soft K-means clustering 
algorithm (algorithm 22.2). We can alternately assign the class labels 
{k n } given the parameters {nki^k}, then update the parameters given 
the class labels. The assignment step involves sampling from the proba- 
bility distributions defined by the responsibilities (22.22), and the update 
step updates the means and variances using probability distributions 
centred on the K-means algorithm's values (22.23, 22.24). 

Do your experiments confirm that Monte Carlo methods bypass the over- 
fitting difficulties of maximum likelihood discussed in section 22.4? 

A solution to this exercise and the previous one, written in octave, is 
available. 2 

> Exercise 29.17. Implement Gibbs sampling for the seven scientists inference 

problem, which we encountered in exercise 22.15 (p. 309), and which you 
may have solved by exact marginalization (exercise 24.3 (p. 323)) [it's 
not essential to have done the latter]. 

> Exercise 29.18.^ A Metropolis method is used to explore a distribution P(x) 

that is actually a 1000-dimensional spherical Gaussian distribution of 
standard deviation 1 in all dimensions. The proposal density Q is a 
1000-dimensional spherical Gaussian distribution of standard deviation 
e. Roughly what is the step size e if the acceptance rate is 0.5? Assuming 
this value of e, 

(a) roughly how long would the method take to traverse the distribution 
and generate a sample independent of the initial condition? 

(b) By how much does lnP(x) change in a typical step? By how much 
should lnP(x) vary when x is drawn from P(x)? 

(c) What happens if, rather than using a Metropolis method that tries 
to change all components at once, one instead uses a concatenation 
of Metropolis updates changing one component at a time? 

2 http : //www. inference .phy . cam. ac . uk/mackay/ itila/ 




Implement Gibbs sampling 
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> Exercise 29.19. ^ When discussing the time taken by the Metropolis algo- 
rithm to generate independent samples we considered a distribution with 
longest spatial length scale L being explored using a proposal distribu- 
tion with step size e. Another dimension that a MCMC method must 
explore is the range of possible values of the log probability lnP*(x). 
Assuming that the state x contains a number of independent random 
variables proportional to N, when samples are drawn from P(x), the 
'asymptotic equipartition' principle tell us that the value of — lnP(x) is 
likely to be close to the entropy of x, varying either side with a standard 
deviation that scales as v^V- Consider a Metropolis method with a sym- 
metrical proposal density, that is, one that satisfies Q(x;x') = Q(x';x). 
Assuming that accepted jumps either increase lnP*(x) by some amount 
or decrease it by a small amount, e.g. lne = 1 (is this a reasonable 
assumption?), discuss how long it must take to generate roughly inde- 
pendent samples from P(x) . Discuss whether Gibbs sampling has similar 
properties. 

Exercise 29.20. ^ Markov chain Monte Carlo methods do not compute parti- 
tion functions Z, yet they allow ratios of quantities like Z to be esti- 
mated. For example, consider a random-walk Metropolis algorithm in a 
state space where the energy is zero in a connected accessible region, and 
infinitely large everywhere else; and imagine that the accessible space can 
be chopped into two regions connected by one or more corridor states. 
The fraction of times spent in each region at equilibrium is proportional 
to the volume of the region. How does the Monte Carlo method manage 
to do this without measuring the volumes? 

Exercise 29.21. ^ Philosophy. 

One curious defect of these Monte Carlo methods - which are widely used 
by Bayesian statisticians - is that they are all non-Bayesian (O'Hagan, 
1987). They involve computer experiments from which estimators of 
quantities of interest are derived. These estimators depend on the pro- 
posal distributions that were used to generate the samples and on the 
random numbers that happened to come out of our random number 
generator. In contrast, an alternative Bayesian approach to the problem 
would use the results of our computer experiments to infer the proper- 
ties of the target function -P(x) and generate predictive distributions for 
quantities of interest such as This approach would give answers that 
would depend only on the computed values of P*(x( r )) at the points 
{xW}; the answers would not depend on how those points were chosen. 

Can you make a Bayesian Monte Carlo method? (See Rasmussen and 
Ghahramani (2003) for a practical attempt.) 

► 29.12 Solutions 

Solution to exercise 29.1 (p. 362). We wish to show that 

ft EE Er"V*(* (r) ) (29 . 52) 
Er W r 

converges to the expectation of ft under P. We consider the numerator and the 
denominator separately. First, the denominator. Consider a single importance 
weight 

P*( x (r)) 

^qWy (29 - 53) 
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29.12: Solutions 

What is its expectation, averaged under the distribution Q = Q*/Zq of the 
point x^ r >1 

K> = / axq{x) 9$) = / dx ^ p * {x) = %■ (29 - 54) 

So the expectation of the denominator is 




(29.55) 



As long as the variance of w r is finite, the denominator, divided by R, will 
converge to Zp/Zq as R increases. [In fact, the estimate converges to the 
right answer even if this variance is infinite, as long as the expectation is 
well-defined.] Similarly, the expectation of one term in the numerator is 

(w r <f>(x)) = J dxQ(x)^^^x) = J dx-^-P*(x)</>(x) = (29.56) 

where $ is the expectation of <j) under P. So the numerator, divided by R, 
converges to with increasing R. Thus <l> converges to $. 

The numerator and the denominator are unbiased estimators of RZp/Zq 
and RZp/Zq<& respectively, but their ratio $ is not necessarily an unbiased 
estimator for finite R. 

Solution to exercise 29.2 (p. 363). When the true density P is multimodal, it is 
unwise to use importance sampling with a sampler density fitted to one mode, 
because on the rare occasions that a point is produced that lands in one of 
the other modes, the weight associated with that point will be enormous. The 
estimates will have enormous variance, but this enormous variance may not 
be evident to the user if no points in the other mode have been seen. 

Solution to exercise 29.5 (p. 371). The posterior distribution for the syndrome 
decoding problem is a pathological distribution from the point of view of Gibbs 
sampling. The factor l[Hn = z] is only 1 on a small fraction of the space of 
possible vectors n, namely the 2 points that correspond to the valid code- 
words. No two codewords are adjacent, so similarly, any single bit flip from 
a viable state n will take us to a state with zero probability and so the state 
will never move in Gibbs sampling. 

A general code has exactly the same problem. The points corresponding 
to valid codewords are relatively few in number and they are not adjacent (at 
least for any useful code). So Gibbs sampling is no use for syndrome decoding 
for two reasons. First, finding any reasonably good hypothesis is difficult, and 
as long as the state is not near a valid codeword, Gibbs sampling cannot help 
since none of the conditional distributions is defined; and second, once we are 
in a valid hypothesis, Gibbs sampling will never take us out of it. 

One could attempt to perform Gibbs sampling using the bits of the original 
message s as the variables. This approach would not get locked up in the way 
just described, but, for a good code, any single bit flip would substantially 
alter the reconstructed codeword, so if one had found a state with reasonably 
large likelihood, Gibbs sampling would take an impractically large time to 
escape from it. 

Solution to exercise 29.12 (p. 380). Each Metropolis proposal will take the 
energy of the state up or down by some amount. The total change in energy 
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when B proposals are concatenated will be the end-point of a random walk 
with B steps in it. This walk might have mean zero, or it might have a 
tendency to drift upwards (if most moves increase the energy and only a few 
decrease it). In general the latter will hold, if the acceptance rate / is small: 
the mean change in energy from any one move will be some AE > and so 
the acceptance probability for the concatenation of B moves will be of order 
1/(1 + exp(— BAEj), which scales roughly as / . The mean-square-distance 
moved will be of order f B Be 2 , where e is the typical step size. In contrast, 
the mean-square-distance moved when the moves are considered individually 
will be of order fBe 2 . 
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Solution to exercise 29.13 (p. 382). The weights are w 
drawn from Q. The mean weight is 



P(x)/Q(x) and x is 



J dx Q(x) [P(x)/Q(x)} = J dx P(x) = 1, (29.57) 



assuming the integral converges. The variance is 



var(w) = J dx Q(x) 
P(x)' 



P(x) 



I 



dx 



Q(x) 



[Q(x) 
- - 2P(x) + Q{x) 



dx — y exp 
Z p 



x 

y 



(29.58) 
(29.59) 
(29.60) 



where Zq/Z p = a q j{\J2Txa^). The integral in (29.60) is finite only if the 
coefficient of x 1 in the exponent is positive, i.e., if 



Figure 29.20. Importance 
sampling in one dimension. For 
R = 1000, 10 4 , and 10 5 , the 
normalizing constant of a 
Gaussian distribution (known in 
fact to be 1) was estimated using 
importance sampling with a 
sampler density of standard 
deviation <j q (horizontal axis). 
The same random number seed 
was used for all runs. The three 
plots show (a) the estimated 
normalizing constant; (b) the 
empirical standard deviation of 
the R weights; (c) 30 of the 
weights. 



°9 > 2<V 



If this condition is satisfied, the variance is 

varH = 7^1 



, 2 1 

27T [ — — — tt 



r 2\V2 



(29.61) 



1. (29.62) 



As a q approaches the critical value - about 0.7a p - the variance becomes 
infinite. Figure 29.20 illustrates these phenomena for a p = 1 with a q varying 
from 0.1 to 1.5. The same random number seed was used for all runs, so 
the weights and estimates follow smooth curves. Notice that the empirical 
standard deviation of the R weights can look quite small and well-behaved 
(say, at a q ~ 0.3) when the true standard deviation is nevertheless infinite. 
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Efficient Monte Carlo Methods 



This chapter discusses several methods for reducing random walk behaviour 
in Metropolis methods. The aim is to reduce the time required to obtain 
effectively independent samples. For brevity, we will say 'independent samples' 
when we mean 'effectively independent samples'. 

► 30.1 Hamiltonian Monte Carlo 

The Hamiltonian Monte Carlo method is a Metropolis method, applicable 
to continuous state spaces, that makes use of gradient information to reduce 
random walk behaviour. [The Hamiltonian Monte Carlo method was originally 
called hybrid Monte Carlo, for historical reasons.] 

For many systems whose probability -P(x) can be written in the form 

P(x) = L-^-, (30.1) 

not only P(x) but also its gradient with respect to x can be readily evaluated. 
It seems wasteful to use a simple random-walk Metropolis method when this 
gradient is available - the gradient indicates which direction one should go in 
to find states that have higher probability! 

Overview of Hamiltonian Monte Carlo 

In the Hamiltonian Monte Carlo method, the state space x is augmented by 
momentum variables p, and there is an alternation of two types of proposal. 
The first proposal randomizes the momentum variable, leaving the state x un- 
changed. The second proposal changes both x and p using simulated Hamil- 
tonian dynamics as defined by the Hamiltonian 

H(x,p) = E(x) + K(p), (30.2) 

where K(p) is a 'kinetic energy' such as K (p) = p T p/2. These two proposals 
are used to create (asymptotically) samples from the joint density 

Pjj(x,p) = J-exp[-.ff(x,p)] = ^exp[-£(x)]exp[-K(p)]. (30.3) 

This density is separable, so the marginal distribution of x is the desired 
distribution exp[— E(yi)]/Z. So, simply discarding the momentum variables, 
we obtain a sequence of samples {x^} that asymptotically come from P(x). 
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% = gradE ( x ) ; 


# 


set gradient using initial x 


T? j; ■ j r ( \ 

h - imdh (. x ; ; 


# 


set objective function too 


for 1 = 1:L 


# 


loop L times 


p = randn ( size(x) ) ; 


# 


initial momentum is Normal(0,l) 


H = p' *p/2 + E; 


# 


evaluate H(x,p) 


xnew = x ; gnew = g ; 






for tau = l:Tau 


# 


make Tau 'leapfrog' steps 


p = p - epsilon * gnew 


/ 2 ; # 


make half-step in p 


xnew = xnew + epsilon * 


p ; # 


make step in x 


gnew = gradE ( xnew ) ; 


# 


find new gradient 


p = p - epsilon * gnew 


/ 2 ; # 


make half-step in p 


endf or 






Knew = finHF ( xnpu 1 


# 


find new value of H 


Hnew = p' * p / 2 + Enew ; 






dH = Hnew - H ; 


# 


Decide whether to accept 


if ( dH < ) 


accept = 1 ; 


elseif ( randO < exp(-dH) 


) accept = 1 ; 


else 


accept = ; 


endif 






if ( accept ) 






g = gnew ; x = xnew ; 


E = 


Enew ; 


endif 






endf or 







Algorithm 30.1. Octave source 
code for the Hamiltonian Monte 
Carlo method. 



Hamiltonian Monte Carlo Simple Metropolis Figure 30.2. (a,b) Hamiltonian 




Monte Carlo used to generate 
samples from a bivariate Gaussian 
with correlation p = 0.998. (c,d) 
For comparison, a simple 
random-walk Metropolis method, 
given equal computer time. 



-1 -0.5 0.5 1 -1 -0.5 0.5 1 

I ' I 1 r ~^l 1 P 1 1 1 T\ 
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30.1: Hamiltonian Monte Carlo 



Details of Hamiltonian Monte Carlo 

The first proposal, which can be viewed as a Gibbs sampling update, draws a 
new momentum from the Gaussian density exp[— K(p)]/Zk- This proposal is 
always accepted. During the second, dynamical proposal, the momentum vari- 
able determines where the state x goes, and the gradient of -E(x) determines 
how the momentum p changes, in accordance with the equations 

x = p (30.4) 
dE(x) 



<9x 



(30.5) 



Because of the persistent motion of x in the direction of the momentum p 
during each dynamical proposal, the state of the system tends to move a 
distance that goes linearly with the computer time, rather than as the square 
root. 

The second proposal is accepted in accordance with the Metropolis rule. 
If the simulation of the Hamiltonian dynamics is numerically perfect then 
the proposals are accepted every time, because the total energy H (x, p) is a 
constant of the motion and so a in equation (29.31) is equal to one. If the 
simulation is imperfect, because of finite step sizes for example, then some of 
the dynamical proposals will be rejected. The rejection rule makes use of the 
change in H(x, p), which is zero if the simulation is perfect. The occasional 
rejections ensure that, asymptotically, we obtain samples (x''',p('') from the 
required joint density -Pff(x, p). 

The source code in figure 30.1 describes a Hamiltonian Monte Carlo method 
that uses the 'leapfrog' algorithm to simulate the dynamics on the function 
findE(x), whose gradient is found by the function gradE(x). Figure 30.2 
shows this algorithm generating samples from a bivariate Gaussian whose en- 
ergy function is E(x) = ^x T Ax with 



A = 



250.25 -249.75 
-249.75 250.25 



(30.6) 



corresponding to a variance-covariance matrix of 

(30.7) 



1 0.998 
0.998 1 



In figure 30.2a, starting from the state marked by the arrow, the solid line 
represents two successive trajectories generated by the Hamiltonian dynamics. 
The squares show the endpoints of these two trajectories. Each trajectory 
consists of Tau = 19 'leapfrog' steps with epsilon = 0.055. These steps are 
indicated by the crosses on the trajectory in the magnified inset. After each 
trajectory, the momentum is randomized. Here, both trajectories are accepted; 
the errors in the Hamiltonian were only +0.016 and —0.06 respectively. 

Figure 30.2b shows how a sequence of four trajectories converges from an 
initial condition, indicated by the arrow, that is not close to the typical set 
of the target distribution. The trajectory parameters Tau and epsilon were 
randomized for each trajectory using uniform distributions with means 19 and 
0.055 respectively. The first trajectory takes us to a new state, (—1.5,-0.5), 
similar in energy to the first state. The second trajectory happens to end in 
a state nearer the bottom of the energy landscape. Here, since the potential 
energy E is smaller, the kinetic energy K = p 2 /2 is necessarily larger than it 
was at the start of the trajectory. When the momentum is randomized before 
the third trajectory, its kinetic energy becomes much smaller. After the fourth 
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Gibbs sampling 



Over/relaxation 



(a) 
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(b) 



(c) 
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Figure 30.3. Overrelaxation 
contrasted with Gibbs sampling 
for a bivariatc Gaussian with 
correlation p = 0.998. (a) The 
state sequence for 40 iterations, 
each iteration involving one 
update of both variables. The 
overrelaxation method had 
a = —0.98. (This excessively large 
value is chosen to make it easy to 
see how the overrelaxation method 
reduces random walk behaviour.) 
The dotted line shows the contour 
x T S — *x = 1. (b) Detail of (a), 
showing the two steps making up 
each iteration, (c) Time-course of 
the variable X\ during 2000 
iterations of the two methods. 
The overrelaxation method had 
a = -0.89. (After Neal (1995).) 



200 400 600 800 1000 1200 1400 1600 1800 2000 



trajectory has been simulated, the state appears to have become typical of the 
target density. 

Figures 30.2(c) and (d) show a random-walk Metropolis method using a 
Gaussian proposal density to sample from the same Gaussian distribution, 
starting from the initial conditions of (a) and (b) respectively. In (c) the step 
size was adjusted such that the acceptance rate was 58%. The number of 
proposals was 38 so the total amount of computer time used was similar to 
that in (a). The distance moved is small because of random walk behaviour. 
In (d) the random-walk Metropolis method was used and started from the 
same initial condition as (b) and given a similar amount of computer time. 

► 30.2 Overrelaxation 

The method of overrelaxation is a method for reducing random walk behaviour 
in Gibbs sampling. Overrelaxation was originally introduced for systems in 
which all the conditional distributions are Gaussian. 

An example of a joint distribution that is not Gaussian but whose conditional 
distributions are all Gaussian is P(x,y) = cxp(—x 2 y 2 — x 2 — y 2 )/Z. 
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30.2: Overrelaxation 

Overtaxation for Gaussian conditional distributions 

In ordinary Gibbs sampling, one draws the new value xf +1 ^ of the current 
variable X{ from its conditional distribution, ignoring the old value xf . The 
state makes lengthy random walks in cases where the variables are strongly 
correlated, as illustrated in the left-hand panel of figure 30.3. This figure uses 
a correlated Gaussian distribution as the target density. 

In Adler's (1981) overrelaxation method, one instead samples xf from 
a Gaussian that is biased to the opposite side of the conditional distribution. 
If the conditional distribution of Xi is Normal(^, a 2 ) and the current value of 
IS X ■ , then Adler's method sets Xi to 

xf +1) = u + a{xf ] -u) + (l- a 2 ) 1 ' 2 ™, (30.8) 

where v ~ Normal(0, 1) and a is a parameter between —1 and 1, usually set to 
a negative value. (If a is positive, then the method is called under-relaxation.) 

!<| \t Exercise 30.1.^ Show that this individual transition leaves invariant the con- 
ditional distribution Xi ~ Normal(^, <r 2 ). 

A single iteration of Adler's overrelaxation, like one of Gibbs sampling, updates 
each variable in turn as indicated in equation (30.8). The transition matrix 
T(x'; x) defined by a complete update of all variables in some fixed order does 
not satisfy detailed balance. Each individual transition for one coordinate 
just described does satisfy detailed balance - so the overall chain gives a valid 
sampling strategy which converges to the target density P(x) - but when we 
form a chain by applying the individual transitions in a fixed sequence, the 
overall chain is not reversible. This temporal asymmetry is the key to why 
overrelaxation can be beneficial. If, say, two variables are positively correlated, 
then they will (on a short timescale) evolve in a directed manner instead of by 
random walk, as shown in figure 30.3. This may significantly reduce the time 
required to obtain independent samples. 

Exercise 30.2. ^ The transition matrix T(x';x) defined by a complete update 
of all variables in some fixed order does not satisfy detailed balance. If 
the updates were in a random order, then T would be symmetric. Inves- 
tigate, for the toy two-dimensional Gaussian distribution, the assertion 
that the advantages of overrelaxation are lost if the overrelaxed updates 
are made in a random order. 

Ordered Overrelaxation 

The overrelaxation method has been generalized by Neal (1995) whose ordered 
overrelaxation method is applicable to any system where Gibbs sampling is 
used. In ordered overrelaxation, instead of taking one sample from the condi- 
tional distribution P(xi \ {xj}j^i), we create K such samples xf~\x\ 2 \ . . . , x\ K \ 
where K might be set to twenty or so. Often, generating K — 1 extra samples 
adds a negligible computational cost to the initial computations required for 
making the first sample. The points {xf ^} are then sorted numerically, and 
the current value of X{ is inserted into the sorted list, giving a list of K + 1 
points. We give them ranks 0,1,2, ... ,K. Let k be the rank of the current 
value of Xi in the list. We set x\ to the value that is an equal distance from 
the other end of the list, that is, the value with rank K — k. The role played 
by Adler's a parameter is here played by the parameter K. When K = 1, we 
obtain ordinary Gibbs sampling. For practical purposes Neal estimates that 
ordered overrelaxation may speed up a simulation by a factor of ten or twenty. 
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► 30.3 Simulated annealing 

A third technique for speeding convergence is simulated annealing. In simu- 
lated annealing, a 'temperature' parameter is introduced which, when large, 
allows the system to make transitions that would be improbable at temper- 
ature 1. The temperature is set to a large value and gradually reduced to 
1. This procedure is supposed to reduce the chance that the simulation gets 
stuck in an unrepresentative probability island. 

We asssume that we wish to sample from a distribution of the form 

e -B(x) 

P(x) = — ^- (30.9) 

where -E(x) can be evaluated. In the simplest simulated annealing method, 
we instead sample from the distribution 

P T (x) = ^ Fy e T (30.10) 

and decrease T gradually to 1. 

Often the energy function can be separated into two terms, 

E(x)=E (x) + E 1 (x), (30.11) 

of which the first term is 'nice' (for example, a separable function of x) and the 
second is 'nasty'. In these cases, a better simulated annealing method might 
make use of the distribution 

P^) = ^ m e- E ^- El{ * ) l T (30.12) 

with T gradually decreasing to 1. In this way, the distribution at high tem- 
peratures reverts to a well-behaved distribution defined by Eq. 

Simulated annealing is often used as an optimization method, where the 
aim is to find an x that minimizes -E(x), in which case the temperature is 
decreased to zero rather than to 1. 

As a Monte Carlo method, simulated annealing as described above doesn't 
sample exactly from the right distribution, because there is no guarantee that 
the probability of falling into one basin of the energy is equal to the total prob- 
ability of all the states in that basin. The closely related 'simulated tempering' 
method (Marinari and Parisi, 1992) corrects the biases introduced by the an- 
nealing process by making the temperature itself a random variable that is 
updated in Metropolis fashion during the simulation. Neal's (1998) 'annealed 
importance sampling' method removes the biases introduced by annealing by 
computing importance weights for each generated point. 



► 30.4 Skilling's multi-state leapfrog method 

A fourth method for speeding up Monte Carlo simulations, due to John 
Skilling, has a similar spirit to overrelaxation, but works in more dimensions. 
This method is applicable to sampling from a distribution over a continuous 
state space, and the sole requirement is that the energy E(x) should be easy 
to evaluate. The gradient is not used. This leapfrog method is not intended to 
be used on its own but rather in sequence with other Monte Carlo operators. 

Instead of moving just one state vector x around the state space, as was 
the case for all the Monte Carlo methods discussed thus far, Skilling's leapfrog 
method simultaneously maintains a set of S state vectors {x^}, where S 
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30.4: Skilling's multi-state leapfrog method 

might be six or twelve. The aim is that all S of these vectors will represent 
independent samples from the same distribution -P(x). 

Skilling's leapfrog makes a proposal for the new state x^' , which is ac- 
cepted or rejected in accordance with the Metropolis method, by leapfrogging 
the current state x' s ' over another state vector x' 4 ': 

X W = x (t) + ( X W _ x «) = 2x « - X W. (30.13) 

All the other state vectors are left where they are, so the acceptance probability x (*) # 
depends only on the change in energy of x^ s \ 

Which vector, t, is the partner for the leapfrog event can be chosen in 
various ways. The simplest method is to select the partner at random from xM» 
the other vectors. It might be better to choose t by selecting one of the 
nearest neighbours x^' - nearest by any chosen distance function - as long 
as one then uses an acceptance rule that ensures detailed balance by checking 
whether point t is still among the nearest neighbours of the new point, x' s ' . 

Why the leapfrog is a good idea 

Imagine that the target density P(x) has strong correlations - for example, 
the density might be a needle-like Gaussian with width e and length Le, where 
L 3> 1. As we have emphasized, motion around such a density by standard 
methods proceeds by a slow random walk. 

Imagine now that our set of S points is lurking initially in a location that 
is probable under the density, but in an inappropriately small ball of size e. 
Now, under Skilling's leapfrog method, a typical first move will take the point 
a little outside the current ball, perhaps doubling its distance from the centre 
of the ball. After all the points have had a chance to move, the ball will have 
increased in size; if all the moves are accepted, the ball will be bigger by a 
factor of two or so in all dimensions. The rejection of some moves will mean 
that the ball containing the points will probably have elongated in the needle's 
long direction by a factor of, say, two. After another cycle through the points, 
the ball will have grown in the long direction by another factor of two. So the 
typical distance travelled in the long dimension grows exponentially with the 
number of iterations. 

Now, maybe a factor of two growth per iteration is on the optimistic side; 
but even if the ball only grows by a factor of, let's say, 1.1 per iteration, the 
growth is nevertheless exponential. It will only take a number of iterations 
proportional to log Lj log(l.l) for the long dimension to be explored. 

> Exercise 30.3. ^' p - 398 l Discuss how the effectiveness of Skilling's method scales 
with dimensionality, using a correlated A r -dimensional Gaussian distri- 
bution as an example. Find an expression for the rejection probability, 
assuming the Markov chain is at equilibrium. Also discuss how it scales 
with the strength of correlation among the Gaussian variables. [Hint: 
Skilling's method is invariant under affine transformations, so the rejec- 
tion probability at equilibrium can be found by looking at the case of a 
separable Gaussian.] 

This method has some similarity to the 'adaptive direction sampling' method 
of Gilks et al. (1994) but the leapfrog method is simpler and can be applied 
to a greater variety of distributions. 
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► 30.5 Monte Carlo algorithms as communication channels 

It may be a helpful perspective, when thinking about speeding up Monte Carlo 
methods, to think about the information that is being communicated. Two 
communications take place when a sample from P(x) is being generated. 

First, the selection of a particular x from P(x) necessarily requires that 
at least log 1/P(x) random bits be consumed. [Recall the use of inverse arith- 
metic coding as a method for generating samples from given distributions 
(section 6.3).] 

Second, the generation of a sample conveys information about -P(x) from 
the subroutine that is able to evaluate P*(x) (and from any other subroutines 
that have access to properties of P*(x)). 

Consider a dumb Metropolis method, for example. In a dumb Metropolis 
method, the proposals Q(x';x) have nothing to do with P(x). Properties 
of P(x) are only involved in the algorithm at the acceptance step, when the 
ratio P*(x')/P*(x) is computed. The channel from the true distribution P(x) 
to the user who is interested in computing properties of P(x) thus passes 
through a bottleneck: all the information about P is conveyed by the string of 
acceptances and rejections. If P(x) were replaced by a different distribution 
P2(x), the only way in which this change would have an influence is that the 
string of acceptances and rejections would be changed. I am not aware of much 
use being made of this information-theoretic view of Monte Carlo algorithms, 
but I think it is an instructive viewpoint: if the aim is to obtain information 
about properties of P(x) then presumably it is helpful to identify the channel 
through which this information flows, and maximize the rate of information 
transfer. 

Example 30.4. The information-theoretic viewpoint offers a simple justification 
for the widely- adopted rule of thumb, which states that the parameters of 
a dumb Metropolis method should be adjusted such that the acceptance 
rate is about one half. Let's call the acceptance history, that is, the 
binary string of accept or reject decisions, a. The information learned 
about P(x) after the algorithm has run for T steps is less than or equal to 
the information content of a, since all information about P is mediated 
by a. And the information content of a is upper-bounded by TiZ^C/); 
where / is the acceptance rate. This bound on information acquired 
about P is maximized by setting / = 1/2. 

Another helpful analogy for a dumb Metropolis method is an evolutionary 
one. Each proposal generates a progeny x' from the current state x. These two 
individuals then compete with each other, and the Metropolis method uses a 
noisy survival-of- the- fittest rule. If the progeny x' is fitter than the parent (i.e., 
P*(x') > P*(x), assuming the Q/Q factor is unity) then the progeny replaces 
the parent. The survival rule also allows less- fit progeny to replace the parent, 
sometimes. Insights about the rate of evolution can thus be applied to Monte 
Carlo methods. 

Exercise 30.5. Let x e {0, 1} G and let P(x) be a separable distribution, 

P(x) = Y[p(x g ), (30.14) 

9 

with p(0) = po and p(l) = p\, for example p\ = 0.1. Let the proposal 
density of a dumb Metropolis algorithm Q involve flipping a fraction m 
of the G bits in the state x. Analyze how long it takes for the chain to 
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30.6: Multi-state methods 

converge to the target density as a function of m. Find the optimal m 
and deduce how long the Metropolis method must run for. 

Compare the result with the results for an evolving population under 
natural selection found in Chapter 19. 

The insight that the fastest progress that a standard Metropolis method 
can make, in information terms, is about one bit per iteration, gives a strong 
motivation for speeding up the algorithm. This chapter has already reviewed 
several methods for reducing random-walk behaviour. Do these methods also 
speed up the rate at which information is acquired? 

Exercise 30.6. ^ Does Gibbs sampling, which is a smart Metropolis method 
whose proposal distributions do depend on -P(x), allow information about 
P(x) to leak out at a rate faster than one bit per iteration? Find toy 
examples in which this question can be precisely investigated. 

Exercise 30.7. ^ Hamiltonian Monte Carlo is another smart Metropolis method 
in which the proposal distributions depend on -P(x). Can Hamiltonian 
Monte Carlo extract information about P(x) at a rate faster than one 
bit per iteration? 

Exercise 30.8. ^ In importance sampling, the weight w r = i- > *(x( r ))/(3*(xM), 
a floating-point number, is computed and retained until the end of the 
computation. In contrast, in the dumb Metropolis method, the ratio 
a = P*(x')/P*(x) is reduced to a single bit ('is o bigger than or smaller 
than the random number uT). Thus in principle importance sampling 
preserves more information about P* than does dumb Metropolis. Can 
you find a toy example in which this extra information does indeed lead 
to faster convergence of importance sampling than Metropolis? Can 
you design a Markov chain Monte Carlo algorithm that moves around 
adaptively, like a Metropolis method, and that retains more useful in- 
formation about the value of P* , like importance sampling? 

In Chapter 19 we noticed that an evolving population of N individuals can 
make faster evolutionary progress if the individuals engage in sexual reproduc- 
tion. This observation motivates looking at Monte Carlo algorithms in which 
multiple parameter vectors x are evolved and interact. 

► 30.6 Multi-state methods 

In a multi-state method, multiple parameter vectors x are maintained; they 
evolve individually under moves such as Metropolis and Gibbs; there are also 
interactions among the vectors. The intention is either that eventually all the 
vectors x should be samples from P(x) (as illustrated by Skilling's leapfrog 
method), or that information associated with the final vectors x should allow 
us to approximate expectations under P(x), as in importance sampling. 

Genetic methods 

Genetic algorithms are not often described by their proponents as Monte Carlo 
algorithms, but I think this is the correct categorization, and an ideal genetic 
algorithm would be one that can be proved to be a valid Monte Carlo algorithm 
that converges to a specified density. 
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I'll use R to denote the number of vectors in the population. We aim to 
have P*({x( r )}^) = Yi P*(^ r ^). A genetic algorithm involves moves of two or 
three types. 

First, individual moves in which one state vector is perturbed, 

X M _^ x (r) 

which could be performed using any of the Monte Carlo methods we have 
mentioned so far. 

Second, we allow crossover moves of the form x,y — > x',y'; in a typical 
crossover move, the progeny x' receives half his state vector from one parent, 
x, and half from the other, y; the secret of success in a genetic algorithm is 
that the parameter x must be encoded in such a way that the crossover of 
two independent states x and y, both of which have good fitness P* , should 
have a reasonably good chance of producing progeny who are equally fit. This 
constraint is a hard one to satisfy in many problems, which is why genetic 
algorithms are mainly talked about and hyped up, and rarely used by serious 
experts. Having introduced a crossover move x,y — > x',y', we need to choose 
an acceptance rule. One easy way to obtain a valid algorithm is to accept or 
reject the crossover proposal using the Metropolis rule with P*({x( r )}^) as 
the target density - this involves comparing the fitnesses before and after the 
crossover using the ratio 

P*(x)P*(y) • (30J5) 

If the crossover operator is reversible then we have an easy proof that this 
procedure satisfies detailed balance and so is a valid component in a chain 
converging to P*({xM}f ). 

> Exercise 30.9. ^ Discuss whether the above two operators, individual varia- 
tion and crossover with the Metropolis acceptance rule, will give a more 
efficient Monte Carlo method than a standard method with only one 
state vector and no crossover. 



The reason why the sexual community could acquire information faster than 
the asexual community in Chapter 19 was because the crossover operation 
produced diversity with standard deviation y/G, then the Blind Watchmaker 
was able to convey lots of information about the fitness function by killing 
off the less fit offspring. The above two operators do not offer a speed-up of 
yfG compared with standard Monte Carlo methods because there is no killing. 
What's required, in order to obtain a speed-up, is two things: multiplication 
and death; and at least one of these must operate selectively. Either we must 
kill off the less-fit state vectors, or we must allow the more-fit state vectors to 
give rise to more offspring. While it's easy to sketch these ideas, it is hard to 
define a valid method for doing it. 

[51 

Exercise 30.10. L J Design a birth rule and a death rule such that the chain 
converges to P*({x^}f). 

I believe this is still an open research problem. 



Particle filters 

Particle filters, which are particularly popular in inference problems involving 
temporal tracking, are multistate methods that mix the ideas of importance 
sampling and Markov chain Monte Carlo. See Isard and Blake (1996), Isard 
and Blake (1998), Berzuini et al. (1997), Berzuini and Gilks (2001), Doucet 
et al. (2001). 
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30. 7: Methods that do not necessarily help 

► 30.7 Methods that do not necessarily help 

It is common practice to use many initial conditions for a particular Markov 
chain (figure 29.19). If you are worried about sampling well from a complicated 
density -P(x), can you ensure the states produced by the simulations are well 
distributed about the typical set of P(x) by ensuring that the initial points 
are 'well distributed about the whole state space'? 

The answer is, unfortunately no. In hierarchical Bayesian models, for 
example, a large number of parameters {x n } may be coupled together via an- 
other parameter (3 (known as a hyperparameter) . For example, the quantities 
{x n } might be independent noise signals, and (3 might be the inverse-variance 
of the noise source. The joint distribution of (3 and {x n } might be 

N 

P((3,{x n }) = P{p)Y[P(Xn\0) 

n=l 

n=l 

where Z{(3) = \/2it/ ' (3 and P{(3) is a broad distribution describing our igno- 
rance about the noise level. For simplicity, let's leave out all the other variables 
- data and such - that might be involved in a realistic problem. Let's imagine 
that we want to sample effectively from P(j3, {x n }) by Gibbs sampling - alter- 
nately sampling (3 from the conditional distribution P{(3 \ x n ) then sampling all 
the x n from their conditional distributions P(x n \ (3). [The resulting marginal 
distribution of (3 should asymptotically be the broad distribution P(/3).] 

If N is large then the conditional distribution of (3 given any particular 
setting of {x n } will be tightly concentrated on a particular most-probable value 
of (3, with width proportional to 1/y/N. Progress up and down the /3-axis will 
therefore take place by a slow random walk with steps of size oc 1/y/N. 

So, to the initialization strategy. Can we finesse our slow convergence 
problem by using initial conditions located 'all over the state space'? Sadly, 
no. If we distribute the points {x n } widely, what we are actually doing is 
favouring an initial value of the noise level 1/(3 that is large. The random 
walk of the parameter (3 will thus tend, after the first drawing of (3 from 
P({3 | x n ), always to start off from one end of the /3-axis. 

Further reading 

The Hamiltonian Monte Carlo method (Duane et al, 1987) is reviewed in Neal 
(1993b). This excellent tome also reviews a huge range of other Monte Carlo 
methods, including the related topics of simulated annealing and free energy 
estimation. 

► 30.8 Further exercises 

Exercise 30.11. ^ An important detail of the Hamiltonian Monte Carlo method 
is that the simulation of the Hamiltonian dynamics, while it may be in- 
accurate, must be perfectly reversible, in the sense that if the initial con- 
dition (x, p) goes to (x',p'), then the same simulator must take (x', — p') 
to (x, — p), and the inaccurate dynamics must conserve state-space vol- 
ume. [The leapfrog method in algorithm 30.1 satisfies these rules.] 

Explain why these rules must be satisfied and create an example illus- 
trating the problems that arise if they are not. 
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Exercise 30.12. ^ A multi-state idea for slice sampling. Investigate the follow- 
ing multi-state method for slice sampling. As in Skilling's multi-state 
leapfrog method (section 30.4), maintain a set of S state vectors {x^}. 
Update one state vector xM by one-dimensional slice sampling in a di- 
rection y determined by picking two other state vectors x'"' and x^ 
at random and setting y 

= x («) _ X H_ 

Investigate this method on toy 
problems such as a highly- correlated multivariate Gaussian distribution. 
Bear in mind that if S — 1 is smaller than the number of dimensions 
N then this method will not be ergodic by itself, so it may need to be 
mixed with other methods. Are there classes of problems that are better 
solved by this slice-sampling method than by the standard methods for 
picking y such as cycling through the coordinate axes or picking u at 
random from a Gaussian distribution? 




,(«) 



► 30.9 Solutions 

Solution to exercise 30.3 (p. 393). Consider the spherical Gaussian distribution 
where all components have mean zero and variance 1. In one dimension, the 
nth, if leapfrogs over x„\ we obtain the proposed coordinate 

(4 1 ))' = 24 2 )-x«. (30.16) 

Assuming that xff and xff are Gaussian random variables from Normal(0, 1), 
(xn^)' is Gaussian from Normal(0, a 2 ), where a 1 = 2 2 + ( — l) 2 = 5. The change 
in energy contributed by this one dimension will be 

\ K 2) - ^ f - (4 1 ') 2 ] = 2(4 2) ) 2 - 2^ 2) 4 1) (30.17) 

so the typical change in energy is 2 ((si 2 ') 2 ) = 2. This positive change is bad 
news. In N dimensions, the typical change in energy when a leapfrog move is 
made, at equilibrium, is thus +2^. The probability of acceptance of the move 
scales as 

e- 2N . (30.18) 

This implies that Skilling's method, as described, is not effective in very high- 
dimensional problems - at least, not once convergence has occurred. Nev- 
ertheless it has the impressive advantage that its convergence properties are 
independent of the strength of correlations between the variables - a property 
that not even the Hamiltonian Monte Carlo and overrelaxation methods offer. 
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About Chapter 31 



Some of the neural network models that we will encounter are related to Ising 
models, which are idealized magnetic systems. It is not essential to understand 
the statistical physics of Ising models to understand these neural networks, but 
I hope you'll find them helpful. 

Ising models are also related to several other topics in this book. We will 
use exact tree-based computation methods like those introduced in Chapter 
25 to evaluate properties of interest in Ising models. Ising models offer crude 
models for binary images. And Ising models relate to two-dimensional con- 
strained channels (cf. Chapter 17): a two-dimensional bar-code in which a 
black dot may not be completely surrounded by black dots, and a white dot 
may not be completely surrounded by white dots, is similar to an antiferro- 
magnetic Ising model at low temperature. Evaluating the entropy of this Ising 
model is equivalent to evaluating the capacity of the constrained channel for 
conveying bits. 

If you would like to jog your memory on statistical physics and thermody- 
namics, you might find Appendix B helpful. I also recommend the book by 
Reif (1965). 
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31 



Ising Models 



An Ising model is an array of spins (e.g., atoms that can take states ±1) that 
are magnetically coupled to each other. If one spin is, say, in the +1 state 
then it is energetically favourable for its immediate neighbours to be in the 
same state, in the case of a ferromagnetic model, and in the opposite state, in 
the case of an antiferromagnet. In this chapter we discuss two computational 
techniques for studying Ising models. 

Let the state x of an Ising model with N spins be a vector in which each 
component x n takes values —1 or +1. If two spins m and n are neighbours we 
write (m, n) € AT . The coupling between neighbouring spins is J. We define 
Jmn = J if m and n are neighbours and J mn = otherwise. The energy of a 
state x is 



E(x;J,H) 



m,n n 



(31.1) 



where H is the applied field. If J > then the model is ferromagnetic, and 
if J < it is antiferromagnetic. We've included the factor of Vl because each 
pair is counted twice in the first sum, once as (m,n) and once as (n,m). At 
equilibrium at temperature T, the probability that the state is x is 

P(x 1 13, J, H) = Z[ ^ J H) exp[-/?£(x; J, H)] , (31.2) 
where f3 = l/k^T, is Boltzmann's constant, and 

Z(/3, J,H) = J2 exp[-/3£(x; J, H)} . (31.3) 

X 

Relevance of Ising models 

Ising models are relevant for three reasons. 

Ising models are important first as models of magnetic systems that have 
a phase transition. The theory of universality in statistical physics shows that 
all systems with the same dimension (here, two), and the same symmetries, 
have equivalent critical properties, i.e., the scaling laws shown by their phase 
transitions are identical. So by studying Ising models we can find out not only 
about magnetic phase transitions but also about phase transitions in many 
other systems. 

Second, if we generalize the energy function to 



£(x;J,h) 



2 

ra.n 



(31.4) 



where the couplings J mn and applied fields h n are not constant, we obtain 
a family of models known as 'spin glasses' to physicists, and as 'Hopfield 
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networks' or 'Boltzmann machines' to the neural network community. In some 
of these models, all spins are declared to be neighbours of each other, in which 
case physicists call the system an 'infinite-range' spin glass, and networkers 
call it a 'fully connected' network. 

Third, the Ising model is also useful as a statistical model in its own right. 

In this chapter we will study Ising models using two different computational 
techniques. 



Some remarkable relationships in statistical physics 

We would like to get as much information as possible out of our computations. 
Consider for example the heat capacity of a system, which is defined to be 

( 31 - 5 ) 

where 

^ = |E ex p(-/ 3£ ( x )) s ( x )- ( 31 - 6 ) 

X 

To work out the heat capacity of a system, we might naively guess that we have 
to increase the temperature and measure the energy change. Heat capacity, 
however, is intimately related to energy fluctuations at constant temperature. 
Let's start from the partition function, 

Z = ^exp(-/3£(x)). (31.7) 

X 

The mean energy is obtained by differentiation with respect to (3: 
dlnZ 



jY,-E(x)exp(-l3E(x)) = -E. (31i 



d[3 Z 

A further differentiation spits out the variance of the energy: 

= \ E^ x ) 2ex P("^( x )) " & = (E 2 ) ~ e2 = var (^ ( 31 - 9 ) 

" X 

But the heat capacity is also the derivative of E with respect to temperature: 
dE d dlnZ d 2 \nZ d(3 



dT dT 8(3 df3 2 dT 

So for any system at temperature T, 



-var(£)(-l/fc B T 2 ). (31.10) 



C=^-P = fc B /3 2 var(£). (31.11) 

Thus if we can observe the variance of the energy of a system at equilibrium, 
we can estimate its heat capacity. 

I find this an almost paradoxical relationship. Consider a system with 
a finite set of states, and imagine heating it up. At high temperature, all 
states will be equiprobable, so the mean energy will be essentially constant 
and the heat capacity will be essentially zero. But on the other hand, with 
all states being equiprobable, there will certainly be fluctuations in energy. 
So how can the heat capacity be related to the fluctuations? The answer is 
in the words 'essentially zero' above. The heat capacity is not quite zero at 
high temperature, it just tends to zero. And it tends to zero as ™^ , with 
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the quantity vai(E) tending to a constant at high temperatures. This 1/T 2 
behaviour of the heat capacity of finite systems at high temperatures is thus 
very general. 

The 1/T 2 factor can be viewed as an accident of history. If only tem- 
perature scales had been defined using (3 = j^-r^, then the definition of heat 
capacity would be 

C& = || = var(£), (31.12) 

and heat capacity and fluctuations would be identical quantities. 

> Exercise 31.1. ^ [We will call the entropy of a physical system S rather than 
H, while we are in a statistical physics chapter; we set &b = 1-] 

The entropy of a system whose states are x, at temperature T = 1/(3, is 



where 



(a) Show that 



S = £>(x)pnl/p(x)] (31.13) 



p(x) = ^exp[-/3£(x)]. (31.14) 



S = In Z{0) +/3E(P) (31.15) 
where E((3) is the mean energy of the system, 
(b) Show that 

dF 

S=-Qf> (31J6) 

where the free energy F = —kT ha. Z and kT = 1/(3. 

► 31.1 Ising models - Monte Carlo simulation 

In this section we study two-dimensional planar Ising models using a simple 
Gibbs-sampling method. Starting from some initial state, a spin n is selected 
at random, and the probability that it should be +1 given the state of the 
other spins and the temperature is computed, 

p ^ = i + eM-2p bn y ^ 

where (3 = 1/k^T and b n is the local field 

b n = Jxm + H. (31.18) 

[The factor of 2 appears in equation (31.17) because the two spin states are 
{+1,-1} rather than {+1,0}.] Spin n is set to +1 with that probability, 
and otherwise to —1; then the next spin to update is selected at random. 
After sufficiently many iterations, this procedure converges to the equilibrium 
distribution (31.2). An alternative to the Gibbs sampling formula (31.17) is 
the Metropolis algorithm, in which we consider the change in energy that 
results from flipping the chosen spin from its current state x n , 

AE = 2x n b n , (31.19) 

and adopt this change in configuration with probability 

P(acce P t; AE, 0) = { £ f ^ Q ° (31.20) 
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31.1: Ising models - Monte Carlo simulation 

This procedure has roughly double the probability of accepting energetically 
unfavourable moves, so may be a more efficient sampler - but at very low tem- 
peratures the relative merits of Gibbs sampling and the Metropolis algorithm 
may be subtle. 

Rectangular geometry 

I first simulated an Ising model with the rectangular geometry shown in fig- 
ure 31.1, and with periodic boundary conditions. A line between two spins 
indicates that they are neighbours. I set the external field H = and con- 
sidered the two cases J = ±1 which are a ferromagnet and antiferromagnet 
respectively. 

I started at a large temperature (T= 33, (3 = 0.03) and changed the temper- 
ature every / iterations, first decreasing it gradually to T = 0.1,/3=10, then 
increasing it gradually back to a large temperature again. This procedure 
gives a crude check on whether 'equilibrium has been reached' at each tem- 
perature; if not, we'd expect to see some hysteresis in the graphs we plot. It 
also gives an idea of the reproducibility of the results, if we assume that the two 
runs, with decreasing and increasing temperature, are effectively independent 
of each other. 

At each temperature I recorded the mean energy per spin and the standard 
deviation of the energy, and the mean square value of the magnetization m, 

m=^^x„. (31.21) 

n 

One tricky decision that has to be made is how soon to start taking these 
measurements after a new temperature has been established; it is difficult to 
detect 'equilibrium' - or even to give a clear definition of a system's being 'at 
equilibrium' ! [But in Chapter 32 we will see a solution to this problem.] My 
crude strategy was to let the number of iterations at each temperature, /, be 
a few hundred times the number of spins N, and to discard the first Vz of 
those iterations. With N= 100, I found I needed more than 100 000 iterations 
to reach equilibrium at any given temperature. 

Results for small N with J — 1 . 

I simulated an I x I grid for I = 4, 5, . . . , 10, 40, 64. Let's have a quick think 
about what results we expect. At low temperatures the system is expected 
to be in a ground state. The rectangular Ising model with J = 1 has two 
ground states, the all +1 state and the all —1 state. The energy per spin of 
either ground state is —2. At high temperatures, the spins are independent, 
all states are equally probable, and the energy is expected to fluctuate around 
a mean of with a standard deviation proportional to 1/ y/N. 

Let's look at some results. In all figures temperature T is shown with 
fee = 1. The basic picture emerges with as few as 16 spins (figure 31.3, 
top): the energy rises monotonically. As we increase the number of spins to 
100 (figure 31.3, bottom) some new details emerge. First, as expected, the 
fluctuations at large temperature decrease as 1/ y/~N. Second, the fluctuations 
at intermediate temperature become relatively bigger. This is the signature 
of a 'collective phenomenon', in this case, a phase transition. Only systems 
with infinite N show true phase transitions, but with N = 100 we are getting 
a hint of the critical fluctuations. Figure 31.5 shows details of the graphs for 
N = 100 and N = 4096. Figure 31.2 shows a sequence of typical states from 
the simulation of N = 4096 spins at a sequence of decreasing temperatures. 
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Figure 31.2. Sample states of 
rectangular Ising models with 
J = 1 at a sequence of 
temperatures T. 
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N 



Mean energy and fluctuations 



Mean square magnetization 



16 



100 





Figure 31.3. Monte Carlo 
simulations of rectangular Ising 
models with .7=1. Mean energy 
and fluctuations in energy as a 
function of temperature (left). 
Mean square magnetization as a 
function of temperature (right). 
In the top row, TV = 16, and the 
bottom, N = 100. For even larger 
N, see later figures. 



Contrast with Schottky anomaly 

A peak in the heat capacity, as a function of temperature, occurs in any system 
that has a finite number of energy levels; a peak is not in itself evidence of a 
phase transition. Such peaks were viewed as anomalies in classical thermody- 
namics, since 'normal' systems with infinite numbers of energy levels (such as 
a particle in a box) have heat capacities that are either constant or increasing 
functions of temperature. In contrast, systems with a finite number of levels 
produced small blips in the heat capacity graph (figure 31.4). 

Let us refresh our memory of the simplest such system, a two- level system 
with states x = (energy 0) and x = 1 (energy e). The mean energy is 



exp(-/3e) 



1 



T + exp(-/3e) 
and the derivative with respect to (3 is 



exp(/3e) 



(31.22) 



dE/d/3 = -e 



2 ex P(/5 £ ) 



exp(/3e)p 



So the heat capacity is 

C = dE/dT = 



dE 1 e 2 exp(/3e) 

'dj3k^~ k B T 2 [l+exp(/3e)p 



(31.23) 
(31.24) 

and the fluctuations in energy are given by vai(E) = Ck B T 2 = —dE/df3, 
which was evaluated in (31.23). The heat capacity and fluctuations are plotted 
in figure 31.6. The take-home message at this point is that whilst Schottky 
anomalies do have a peak in the heat capacity, there is no peak in their 
fluctuations; the variance of the energy simply increases monotonically with 
temperature to a value proportional to the number of independent spins. Thus 
it is a peak in the fluctuations that is interesting, rather than a peak in the 
heat capacity. The Ising model has such a peak in its fluctuations, as can be 
seen in the second row of figure 31.5. 



Figure 31.4. Schematic diagram to 
explain the meaning of a Schottky 
anomaly. The curve shows the 
heat capacity of two gases as a 
function of temperature. The 
lower curve shows a normal gas 
whose heat capacity is an 
increasing function of 
temperature. The upper curve has 
a small peak in the heat capacity, 
which is known as a Schottky 
anomaly (at least in Cambridge). 
The peak is produced by the gas 
having magnetic degrees of 
freedom with a finite number of 
accessible states. 



Rectangular Ising model with J — — 1 

What do we expect to happen in the case J = — 1? The ground states of an 
infinite system are the two checkerboard patterns (figure 31.7), and they have 
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N = 100 N = 4096 Figure 31.5. Detail of Monte Carlo 




simulations of rectangular Ising 
models with J = 1. (a) Mean 
energy and fluctuations in energy 
as a function of temperature, (b) 
Fluctuations in energy (standard 
deviation), (c) Mean square 
magnetization, (d) Heat capacity. 




Figure 31.6. Schottky anomaly - 
Heat capacity and fluctuations in 
energy as a function of 
temperature for a two-level system 
with separation e = 1 and fee = 1. 



0.1 1 10 

Temperature 
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energy per spin —2, like the ground states of the J = 1 model. Can this analogy 
be pressed further?A moment's reflection will confirm that the two systems 
are equivalent to each other under a checkerboard symmetry operation. If you 
take an infinite J = 1 system in some state and flip all the spins that lie on 
the black squares of an infinite checkerboard, and set J = — 1 (figure 31.8), 
then the energy is unchanged. (The magnetization changes, of course.) So all 
thermodynamic properties of the two systems are expected to be identical in 
the case of zero applied field. 

But there is a subtlety lurking here. Have you spotted it? We are simu- 
lating finite grids with periodic boundary conditions. If the size of the grid in 
any direction is odd, then the checkerboard operation is no longer a symme- 
try operation relating J = +1 to J = —1, because the checkerboard doesn't 
match up at the boundaries. This means that for systems of odd size, the 
ground state of a system with J = — 1 will have degeneracy greater than 2, 
and the energy of those ground states will not be as low as —2 per spin. So we 
expect qualitative differences between the cases J = ±1 in odd-sized systems. 
These differences are expected to be most prominent for small systems. The 
frustrations are introduced by the boundaries, and the length of the boundary 
grows as the square root of the system size, so the fractional influence of this 
boundary-related frustration on the energy and entropy of the system will de- 
crease as 1/y/N. Figure 31.9 compares the energies of the ferromagnetic and 
antiferromagnetic models with N = 25. Here, the difference is striking. 





Figure 31.7. The two ground 
states of a rectangular Ising model 
with J = — 1. 



J = -1 



J = +l 




Figure 31.8. Two states of 

rectangular Ising models with 

J = ±1 that have identical energy. 




Figure 31.9. Monte Carlo 
simulations of rectangular Ising 
models with J = ±1 and N = 25. 
Mean energy and fluctuations in 
energy as a function of 
temperature, (a) J = 1. (b) 
J = -1. 



Temperature 



Temperature 



Triangular Ising model 

We can repeat these computations for a triangular Ising model. Do we expect 
the triangular Ising model with J = ±1 to show different physical properties 
from the rectangular Ising model? Presumably the J = 1 model will have 
broadly similar properties to its rectangular counterpart. But the case J = — 1 
is radically different from what's gone before. Think about it: there is no 
unfrustrated ground state; in any state, there must be frustrations - pairs of 
neighbours who have the same sign as each other. Unlike the case of the 
rectangular model with odd size, the frustrations are not introduced by the 
periodic boundary conditions. Every set of three mutually neighbouring spins 
must be in a state of frustration, as shown in figure 31.10. (Solid lines show 
'happy' couplings which contribute — |J| to the energy; dashed lines show 
'unhappy' couplings which contribute |J|.) Thus we certainly expect different 
behaviour at low temperatures. In fact we might expect this system to have 
a non-zero entropy at absolute zero. ('Triangular model violates third law of 
thermodynamics! ') 

Let's look at some results. Sample states are shown in figure 31.12, and 
figure 31.11 shows the energy, fluctuations, and heat capacity for N = 4096. 




(b) 



Figure 31.10. In an 

antiferromagnetic triangular Ising 
model, any three neighbouring 
spins are frustrated. Of the eight 
possible configurations of three 
spins, six have energy —\J\ (a), 
and two have energy 3| J| (b). 
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31.2: Direct computation of partition function of Ising models 407 

Note how different the results for J = ±1 are. There is no peak at all in 
the standard deviation of the energy in the case J = — 1. This indicates that 
the antiferromagnetic system does not have a phase transition to a state with 
long-range order. 



J = +l J=-\ 




Figure 31.11. Monte Carlo 
simulations of triangular Ising 
models with J = ±1 and 
iV = 4096. (a-c) J = 1. (d-f) 
J = — 1. (a, d) Mean energy and 
fluctuations in energy as a 
function of temperature, (b, e) 
Fluctuations in energy (standard 
deviation), (c, f) Heat capacity. 




► 31.2 Direct computation of partition function of Ising models 

We now examine a completely different approach to Ising models. The trans- 
fer matrix method is an exact and abstract approach that obtains physical 
properties of the model from the partition function 

Z(/3, J, b) = exp[-/3£(x; J, b)] , (31.25) 

X 

where the summation is over all states x, and the inverse temperature is 
/3 = 1/T. [As usual, Let fee = 1-] The free energy is given by F = — jjlnZ. 
The number of states is 2^, so direct computation of the partition function 
is not possible for large N. To avoid enumerating all global states explicitly, 
we can use a trick similar to the sum-product algorithm discussed in Chapter 
25. We concentrate on models that have the form of a long thin strip of width 
W with periodic boundary conditions in both directions, and we iterate along 
the length of our model, working out a set of partial partition functions at one 
location I in terms of partial partition functions at the previous location I — 1. 
Each iteration involves a summation over all the states at the boundary. This 
operation is exponential in the width of the strip, W. The final clever trick 
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409 



is to note that if the system is translation-invariant along its length then we 
only need to do one iteration in order to find the properties of a system of any 
length. 

The computational task becomes the evaluation of an S x S matrix, where 
S is the number of microstates that need to be considered at the boundary, 
and the computation of its eigenvalues. The eigenvalue of largest magnitude 
gives the partition function for an infinite-length thin strip. 

Here is a more detailed explanation. Label the states of the C columns of 
the thin strip s\,S2, ■ ■ ■ , sq, with each s an integer from to 2 W — 1. The rth 
bit of s c indicates whether the spin in row r, column c is up or down. The 
partition function is 



5>xp(-/3£(x)) 



(31.26) 



= EE-E ex P (-/?5>(*c,*<h-i) , (31.27) 

SI S 2 8 C \ C=l / 

where £(s c ,s c+ i) is an appropriately defined energy, and, if we want periodic 
boundary conditions, sc+i is defined to be s\. One definition for £ is: 

£( s c: s c+l) = ^ ^ J x m%n ~\~ 4 ^ ] J x m%n ~\~ 4 ^ ] J x m x n- (31.28) 

(m,n)eN: (m,n)eM: (m,n)eAf: 



m£c+l,nEc+l 



This definition of the energy has the nice property that (for the rectangular 
Ising model) it defines a matrix that is symmetric in its two indices s c , s c+i- 
The factors of 1/4 are needed because vertical links are counted four times. 
Let us define 

M ss , = exp(-/?£(s,s')) . (31.29) 
Then continuing from equation (31.27), 



sc Lc=l 



II M -, 



Sc+l 



Trace [M c ] 

EX' 



(31.30) 

(31.31) 
(31.32) 




S2 s 3 



Figure 31.13. Illustration to help 
explain the definition (31.28). 
£{s2, S3) counts all the 
contributions to the energy in the 
rectangle. The total energy is 
given by stepping the rectangle 
along. Each horizontal bond 
inside the rectangle is counted 
once; each vertical bond is 
half-inside the rectangle (and will 
be half-inside an adjacent 
rectangle) so half its energy is 
included in £(s2, S3); the factor of 
1/4 appears in the second term 
because m and n both run over all 
nodes in column c, so each bond is 
visited twice. 

For the state shown here, 
s 2 = (100)2, S3 = (H0)2, the 
horizontal bonds contribute + J to 
£(s 2 , S3), and the vertical bonds 
contribute — J/2 on the left and 
— J/2 on the right, assuming 
periodic boundary conditions 
between top and bottom. So 

£(S2,S 3 ) = 0. 



where {^a}a=i are t ne eigenvalues of M. As the length of the strip C increases, 
Z becomes dominated by the largest eigenvalue /i max : 



Z -> /4 ax . (31.33) 

So the free energy per spin in the limit of an infinite thin strip is given by: 

/ = -kT\nZ/{WC) = -kTC\nu. max /{WC) = -fcTln Mmax /H/. (31.34) 

It's really neat that all the thermodynamic properties of a long thin strip can 
be obtained from just the largest eigenvalue of this matrix M! 



Computations 

I computed the partition functions of long-thin-strip Ising models with the 
geometries shown in figure 31.14. 

As in the last section, I set the applied field H to zero and considered the 
two cases J = ±1 which are a ferromagnet and antiferromagnet respectively. I 
computed the free energy per spin, /(/?, J, H) = F/N for widths from W = 2 
to 8 as a function of (3 for H = 0. 
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Rectangular: 



I I I I I ^ I ^ ^ I I 



W 



w 



Triangular: 



I I I I 



I I I I 



Computational ideas: 

Only the largest eigenvalue is needed. There are several ways of getting this 
quantity, for example, iterative multiplication of the matrix by an initial vec- 
tor. Because the matrix is all positive we know that the principal eigenvector 
is all positive too (Frobenius-Perron theorem) , so a reasonable initial vector is 
(1,1,... ,1). This iterative procedure may be faster than explicit computation 
of all eigenvalues. I computed them all anyway, which has the advantage that 
we can find the free energy of finite length strips - using equation (31.32) - as 
well as infinite ones. 



Ferromagnets of width 8 



Antiferromagnets of width 8 




Figure 31.14. Two long-thin-strip 
Ising models. A line between two 
spins indicates that they are 
neighbours. The strips have width 
W and infinite length. 



Figure 31.15. Free energy per spin 
of long-thin-strip Ising models. 
Note the non-zero gradient at 
T = in the case of the triangular 
ant ifcrromagnct . 



Temperature 



Comments on graphs: 

For large temperatures all Ising models should show the same behaviour: the 
free energy is entropy-dominated, and the entropy per spin is ln(2). The mean 
energy per spin goes to zero. The free energy per spin should tend to — ln(2)/ /3. 
The free energies are shown in figure 31.15. 

One of the interesting properties we can obtain from the free energy is 
the degeneracy of the ground state. As the temperature goes to zero, the 
Boltzmann distribution becomes concentrated in the ground state. If the 
ground state is degenerate (i.e., there are multiple ground states with identical 



0.7 
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0.3 




0.2 
0.1 


Triangular(-) 

Rectangular 

Triangular(-t-) 


n 





2 4 6 8 10 

Temperature 



Figure 31.16. Entropies (in nats) 
of width 8 Ising systems as a 
function of temperature, obtained 
by differentiating the free energy 
curves in figure 31.15. The 
rectangular fcrromagnct and 
antifcrromagnet have identical 
thermal properties. For the 
triangular systems, the upper 
curve (— ) denotes the 
antifcrromagnet and the lower- 
curve (+) the ferromagnct. 
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Figure 31.17. Mean energy versus 
temperature of long thin strip 
Ising models with width 8. 
Compare with figure 31.3. 



Rectangular Ferromagnet 




Triangular Ising Models 
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Figure 31.18. Heat capacities of 
(a) rectangular model; (b) 
triangular models with different 
widths, (+) and (— ) denoting 
ferromagnet and antiferromagnet. 
Compare with figure 31.11. 



Temperature 



Temperature 



energy) then the entropy as T — > is non-zero. We can find the entropy from 
the free energy using S = —BF/dT. 

The entropy of the triangular antiferromagnet at absolute zero appears to 
be about 0.3, that is, about half its high temperature value (figure 31.16). 
The mean energy as a function of temperature is plotted in figure 31.17. It is 
evaluated using the identity (E) = — dlnZ/d/3. 

Figure 31.18 shows the estimated heat capacity (taking raw derivatives of 
the mean energy) as a function of temperature for the triangular models with 
widths 4 and 8. Figure 31.19 shows the fluctuations in energy as a function of 
temperature. All of these figures should show smooth graphs; the roughness of 
the curves is due to inaccurate numerics. The nature of any phase transition 
is not obvious, but the graphs seem compatible with the assertion that the 
ferromagnet shows, and the antiferromagnet does not show a phase transition. 

The pictures of the free energy in figure 31.15 give some insight into how 
we could predict the transition temperature. We can see how the two phases 
of the ferromagnetic systems each have simple free energies: a straight sloping 
line through F = 0, T = for the high temperature phase, and a horizontal 
line for the low temperature phase. (The slope of each line shows what the 
entropy per spin of that phase is.) The phase transition occurs roughly at 
the intersection of these lines. So we predict the transition temperature to be 
linearly related to the ground state energy. 

Figure 31.19. Energy variances, 
per spin, of (a) rectangular model; 
(b) triangular models with 
different widths, (+) and (— ) 
denoting ferromagnet and 
antiferromagnet. Compare with 
figure 31.11. 



Rectangular Ferromagnet Triangular Ising Models 




1 10 1 10 

Temperature Temperature 
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Comparison with the Monte Carlo results 

The agreement between the results of the two experiments seems very good. 
The two systems simulated (the long thin strip and the periodic square) are 
not quite identical. One could a more accurate comparison by finding all 
eigenvalues for the strip of width W and computing X w to get the partition 
function of a W x W patch. 



► 31.3 Exercises 

> Exercise 31.2. ^ What would be the best way to extract the entropy from the 
Monte Carlo simulations? What would be the best way to obtain the 
entropy and the heat capacity from the partition function computation? 

Exercise 31.3.^ An Ising model may be generalized to have a coupling J mn 
«S-- between any spins m and n, and the value of J mn could be different for each 
m and n. In the special case where all the couplings are positive we know 
that the system has two ground states, the all-up and all-down states. For a 
more general setting of J mn it is conceivable that there could be many ground 
states. 

Imagine that it is required to make a spin system whose local minima are 
a given list of states X( 1 ),X( 2 ), • • • , X (S)- Can you think of a way of setting J 
such that the chosen states are low energy states? You are allowed to adjust 
all the { Jmn} to whatever values you wish. 
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32 



Exact Monte Carlo Sampling 



► 32.1 The problem with Monte Carlo methods 

For high-dimensional problems, the most widely used random sampling meth- 
ods are Markov chain Monte Carlo methods like the Metropolis method, Gibbs 
sampling, and slice sampling. 

The problem with all these methods is this: yes, a given algorithm can be 
guaranteed to produce samples from the target density -P(x) asymptotically, 
'once the chain has converged to the equilibrium distribution'. But if one runs 
the chain for too short a time T, then the samples will come from some other 
distribution p( T ) (x). For how long must the Markov chain be run before it has 
'converged'? As was mentioned in Chapter 29, this question is usually very 
hard to answer. However, the pioneering work of Propp and Wilson (1996) 
allows one, for certain chains, to answer this very question; furthermore Propp 
and Wilson show how to obtain 'exact' samples from the target density. 

► 32.2 Exact sampling concepts 

Propp and Wilson's exact sampling method (also known as 'perfect simulation' 
or 'coupling from the past') depends on three ideas. 

Coalescence of coupled Markov chains 

First, if several Markov chains starting from different initial conditions share 
a single random-number generator, then their trajectories in state space may 
coalesce; and, having, coalesced, will not separate again. If all initial condi- 
tions lead to trajectories that coalesce into a single trajectory, then we can be 
sure that the Markov chain has 'forgotten' its initial condition. Figure 32.1a-i 
shows twenty-one Markov chains identical to the one described in section 29.4, 
which samples from {0, 1, . . . , 20} using the Metropolis algorithm (figure 29.12, 
p. 368); each of the chains has a different initial condition but they are all driven 
by a single random number generator; the chains coalesce after about 80 steps. 
Figure 32.1a-ii shows the same Markov chains with a different random number 
seed; in this case, coalescence does not occur until 400 steps have elapsed (not 
shown). Figure 32.1b shows similar Markov chains, each of which has identical 
proposal density to those in section 29.4 and figure 32.1a; but in figure 32.1b, 
the proposed move at each step, 'left' or 'right', is obtained in the same way by 
all the chains at any timestep, independent of the current state. This coupling 
of the chains changes the statistics of coalescence. Because two neighbouring 
paths only merge when a rejection occurs, and rejections only occur at the 
walls (for this particular Markov chain), coalescence will occur only when the 
chains are all in the leftmost state or all in the rightmost state. 
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Figure 32.1. Coalescence, the first 
idea behind the exact sampling 
method. Time runs from bottom 
to top. In the leftmost panel, 
coalescence occurred within 100 
steps. Different coalescence 
properties arc obtained depending 
on the way each state uses the 
random numbers it is supplied 
with, (a) Two runs of a 
Metropolis simulator in which the 
random bits that determine the 
proposed step depend on the 
current state; a different random 
number seed was used in each 
case, (b) In this simulator the 
random proposal ('left' or 'right') 
is the same for all states. In each 
panel, one of the paths, the one 
starting at location x = 8, has 
been highlighted. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 

32.2: Exact sampling concepts 
Coupling from the past 

How can we use the coalescence property to find an exact sample from the 
equilibrium distribution of the chain? The state of the system at the moment 
when complete coalescence occurs is not a valid sample from the equilibrium 
distribution; for example in figure 32.1b, final coalescence always occurs when 
the state is against one of the two walls, because trajectories only merge at 
the walls. So sampling forward in time until coalescence occurs is not a valid 
method. 

The second key idea of exact sampling is that we can obtain exact samples 
by sampling from a time To in the past, up to the present. If coalescence 
has occurred, the present sample is an unbiased sample from the equilibrium 
distribution; if not, we restart the simulation from a time Tq further into 
the past, reusing the same random numbers. The simulation is repeated at a 
sequence of ever more distant times To, with a doubling of To from one run to 
the next being a convenient choice. When coalescence occurs at a time before 
'the present', we can record x(0) as an exact sample from the equilibrium 
distribution of the Markov chain. 

Figure 32.2 shows two exact samples produced in this way. In the leftmost 
panel of figure 32.2a, we start twenty-one chains in all possible initial condi- 
tions at To = —50 and run them forward in time. Coalescence does not occur. 
We restart the simulation from all possible initial conditions at To = —100, 
and reset the random number generator in such a way that the random num- 
bers generated at each time t (in particular, from t = —50 to t = 0) will be 
identical to what they were in the first run. Notice that the trajectories pro- 
duced from t = —50 to t = by these runs that started from Tq = —100 are 
identical to a subset of the trajectories in the first simulation with To = —50. 
Coalescence still does not occur, so we double To again to To = —200. This 
time, all the trajectories coalesce and we obtain an exact sample, shown by 
the arrow. If we pick an earlier time such as Tq = —500, all the trajectories 
must still end in the same point at t = 0, since every trajectory must pass 
through some state at t = —200, and all those states lead to the same final 
point. So if we ran the Markov chain for an infinite time in the past, from any 
initial condition, it would end in the same state. Figure 32.2b shows an exact 
sample produced in the same way with the Markov chains of figure 32.1b. 

This method, called coupling from the past, is important because it allows 
us to obtain exact samples from the equilibrium distribution; but, as described 
here, it is of little practical use, since we are obliged to simulate chains starting 
in all initial states. In the examples shown, there are only twenty-one states, 
but in any realistic sampling problem there will be an utterly enormous number 
of states - think of the 2 1000 states of a system of 1000 binary spins, for 
example. The whole point of introducing Monte Carlo methods was to try to 
avoid having to visit all the states of such a system! 

Monotonicity 

Having established that we can obtain valid samples by simulating forward 
from times in the past, starting in all possible states at those times, the third 
trick of Propp and Wilson, which makes the exact sampling method useful 
in practice, is the idea that, for some Markov chains, it may be possible to 
detect coalescence of all trajectories without simulating all those trajectories. 
This property holds, for example, in the chain of figure 32.1b, which has the 
property that two trajectories never cross. So if we simply track the two tra- 
jectories starting from the leftmost and rightmost states, we will know that 
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Figure 32.3. (a) Ordering of states, the third idea behind the exact sampling method. The trajectories 
shown here are the left-most and right-most trajectories of figure 32.2b. In order to establish 
what the state at time zero is, we only need to run simulations from T = —50, T = —100, 
and To = —200, after which point coalescence occurs. 

(b,c) Two more exact samples from the target density, generated by this method, and 
different random number seeds. The initial times required were T = —50 and T = —1000, 
respectively. 
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coalescence of all trajectories has occurred when those two trajectories co- 
alesce. Figure 32.3a illustrates this idea by showing only the left-most and 
right-most trajectories of figure 32.2b. Figure 32.3(b,c) shows two more ex- 
act samples from the same equilibrium distribution generated by running the 
'coupling from the past' method starting from the two end-states alone. In 
(b), two runs coalesced starting from To = —50; in (c), it was necessary to try 
times up to To = —1000 to achieve coalescence. 

► 32.3 Exact sampling from interesting distributions 

In the toy problem we studied, the states could be put in a one-dimensional 
order such that no two trajectories crossed. The states of many interesting 
state spaces can also be put into a partial order and coupled Markov chains 
can be found that respect this partial order. [An example of a partial order 
on the four possible states of two spins is this: (+,+) > (+, — ) > ( — , — ); 
and (+,+) > (— ,+) > ( — ,—); and the states (+, — ) and (— ,+) are not 
ordered.] For such systems, we can show that coalescence has occurred merely 
by verifying that coalescence has occurred for all the histories whose initial 
states were 'maximal' and 'minimal' states of the state space. 

As an example, consider the Gibbs sampling method applied to a ferro- 
magnetic Ising spin system, with the partial ordering of states being defined 
thus: state x is 'greater than or equal to' state y if Xi > Hi for all spins i. 
The maximal and minimal states are the the all-up and all-down states. The 
Markov chains are coupled together as shown in algorithm 32.4. Propp and 
Wilson (1996) show that exact samples can be generated for this system, al- 
though the time to find exact samples is large if the Ising model is below its 
critical temperature, since the Gibbs sampling method itself is slowly-mixing 
under these conditions. Propp and Wilson have improved on this method for 
the Ising model by using a Markov chain called the single-bond heat bath 
algorithm to sample from a related model called the random cluster model; 
they show that exact samples from the random cluster model can be obtained 
rapidly and can be converted into exact samples from the Ising model. Their 
ground-breaking paper includes an exact sample from a 16-million-spin Ising 
model at its critical temperature. A sample for a smaller Ising model is shown 
in figure 32.5. 

A generalization of the exact sampling method for 'non-attractive ' distri- 
butions 

The method of Propp and Wilson for the Ising model, sketched above, can 
only be applied to probability distributions that are, as they call them, 'at- 
tractive'. Rather than define this term, let's say what it means, for practical 
purposes: the method can be applied to spin systems in which all the cou- 
plings are positive (e.g., the ferromagnet), and to a few special spin systems 
with negative couplings (e.g., as we already observed in Chapter 31, the rect- 
angular ferromagnet and antiferromagnet are equivalent); but it cannot be 
applied to general spin systems in which some couplings are negative, because 
in such systems the trajectories followed by the all-up and all-down states 
are not guaranteed to be upper and lower bounds for the set of all trajecto- 
ries. Fortunately, however, we do not need to be so strict. It is possible to 
re-express the Propp and Wilson algorithm in a way that generalizes to the 
case of spin systems with negative couplings. The idea of the summary state 
version of exact sampling is still that we keep track of bounds on the set of 



Compute ai := ^ ■ JijXj 
Draw u from Uniform(0, 1) 
If u< l/(l + e- 2a ») 

Xi := +1 
Else 

Xi := -1 



Algorithm 32.4. Gibbs sampling 
coupling method. The Markov 
chains arc coupled together by 
having all chains update the same 
spin i at each time step and 
having all chains share a common 
sequence of random numbers u. 




Figure 32.5. An exact sample from 
the Ising model at its critical 
temperature, produced by 
D.B. Wilson. Such samples can be 
produced within seconds on an 
ordinary computer by exact 
sampling. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



32.3: Exact sampling from interesting distributions 



419 



all trajectories, and detect when these bounds are equal, so as to find exact 
samples. But the bounds will not themselves be actual trajectories, and they 
will not necessarily be tight bounds. 

Instead of simulating two trajectories, each of which moves in a state space 
{ — 1, +1}^, we simulate one trajectory envelope in an augmented state space 
{ — 1, +1, 7} N , where the symbol ? denotes 'either —1 or +1'. We call the state 
of this augmented system the 'summary state'. An example summary state of 
a six-spin system is ++-?+?. This summary state is shorthand for the set of 
states 



The update rule at each step of the Markov chain takes a single spin, enu- 
merates all possible states of the neighbouring spins that are compatible with 
the current summary state, and, for each of these local scenarios, computes 
the new value (+ or -) of the spin using Gibbs sampling (coupled to a random 
number u as in algorithm 32.4). If all these new values agree, then the new 
value of the updated spin in the summary state is set to the unanimous value 
(+ or -). Otherwise, the new value of the spin in the summary state is '?'. The 
initial condition, at time Tq, is given by setting all the spins in the summary 
state to '?', which corresponds to considering all possible start configurations. 

In the case of a spin system with positive couplings, this summary state 
simulation will be identical to the simulation of the uppermost state and low- 
ermost states, in the style of Propp and Wilson, with coalescence occuring 
when all the '?' symbols have disappeared. The summary state method can 
be applied to general spin systems with any couplings. The only shortcoming 
of this method is that the envelope may describe an unnecessarily large set of 
states, so there is no guarantee that the summary state algorithm will con- 
verge; the time for coalescence to be detected may be considerably larger than 
the actual time taken for the underlying Markov chain to coalesce. 

The summary state scheme has been applied to exact sampling in belief 
networks by Harvey and Neal (2000), and to the triangular antiferromagnetic 
Ising model by Childs et al. (2001). Summary state methods were first intro- 
duced by Huber (1998); they also go by the names sandwiching methods and 
bounding chains. 

Further reading 

For further reading, impressive pictures of exact samples from other distribu- 
tions, and generalizations of the exact sampling method, browse the perfectly- 
random sampling website. 1 

For beautiful exact-sampling demonstrations running live in your web- 
browser, see Jim Propp's website. 2 

Other uses for coupling 

The idea of coupling together Markov chains by having them share a random 
number generator has other applications beyond exact sampling. Pinto and 
Neal (2001) have shown that the accuracy of estimates obtained from a Markov 
chain Monte Carlo simulation (the second problem discussed in section 29.1, 
p. 357), using the estimator 



++-+++, ++-++-, ++ — ++, ++ — +- . 




(32.1) 



t 



1 http : //www. dbwilson. com/exact/ 

2 http : //www. math, wise . edu/~propp/tiling/www/applets/ 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



420 




32 — Exact Monte Carlo Sampling 

Figure 32.6. A perfectly random 
tiling of a hexagon by lozenges, 
provided by J.G. Propp and 
D.B. Wilson. 



can be improved by coupling the chain of interest, which converges to P, to a 
second chain, which generates samples from a second, simpler distribution, Q. 
The coupling must be set up in such a way that the states of the two chains 
are strongly correlated. The idea is that we first estimate the expectations of 
a function of interest, 0, under P and under Q in the normal way (32.1) and 
compare the estimate under Q, <£>q, with the true value of the expectation 
under Q, <£>q which we assume can be evaluated exactly. If $q is an overes- 
timate then it is likely that 4>p will be an overestimate too. The difference 
(<3E>q — $q) can thus be used to correct <3?p. 



► 32.4 Exercises 

> Exercise 32.1.^' p ' 421 l Is there any relationship between the probability dis- 

tribution of the time taken for all trajectories to coalesce, and the equi- 
libration time of a Markov chain? Prove that there is a relationship, or 
find a single chain that can be realized in two different ways that have 
different coalescence times. 

> Exercise 32.2.^ Imagine that Fred ignores the requirement that the random 

bits used at some time t, in every run from increasingly distant times 
To, must be identical, and makes a coupled-Markov-chain simulator that 
uses fresh random numbers every time To is changed. Describe what 
happens if Fred applies his method to the Markov chain that is intended 
to sample from the uniform distribution over the states 0, 1, and 2, using 
the Metropolis method, driven by a random bit source as in figure 32.1b. 

Exercise 32.3. ^ Investigate the application of perfect sampling to linear re- 
gression in Holmes and Mallick (1998) or Holmes and Denison (2002) 
and try to generalize it. 

Exercise 32.4. ^ The concept of coalescence has many applications. Some sur- 
names are more frequent than others, and some die out altogether. Make 
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32.5: Solutions 

a model of this process; how long will it take until everyone has the same 
surname? 

Similarly, variability in any particular portion of the human genome 
(which forms the basis of forensic DNA fingerprinting) is inherited like a 
surname. A DNA fingerprint is like a string of surnames. Should the fact 
that these surnames are subject to coalescences, so that some surnames 
are by chance more prevalent than others, affect the way in which DNA 
fingerprint evidence is used in court? 

> Exercise 32.5.^ How can you use a coin to create a random ranking of 3 
people? Construct a solution that uses exact sampling. For example, 
you could apply exact sampling to a Markov chain in which the coin is 
repeatedly used alternately to decide whether to switch first and second, 
then whether to switch second and third. 

Exercise 32.6. ^ Finding the partition function Z of a probability distribution 
is a difficult problem. Many Markov chain Monte Carlo methods produce 
valid samples from a distribution without ever finding out what Z is. 

Is there any probability distribution and Markov chain such that either 
the time taken to produce a perfect sample or the number of random bits 
used to create a perfect sample are related to the value of Z? Are there 
some situations in which the time to coalescence conveys information 
about Zl 

► 32.5 Solutions 

Solution to exercise 32.1 (p. 420). It is perhaps surprising that there is no di- 
rect relationship between the equilibration time and the time to coalescence. 
We can prove this using the example of the uniform distribution over the inte- 
gers A = {0, 1,2,..., 20}. A Markov chain that converges to this distribution 
in exactly one iteration is the chain for which the probability of state St+i 
given st is the uniform distribution, for all Sf. Such a chain can be coupled 
to a random number generator in two ways: (a) we could draw a random 
integer u € A, and set s t+ \ equal to u regardless of s t ; or (b) we could draw 
a random integer u € A, and set st+i equal to (s t + u) mod 21. Method (b) 
would produce a cohort of trajectories locked together, similar to the trajec- 
tories in figure 32.1, except that no coalescence ever occurs. Thus, while the 
equilibration times of methods (a) and (b) are both one, the coalescence times 
are respectively one and infinity. 

It seems plausible on the other hand that coalescence time provides some 
sort of upper bound on equilibration time. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



33 



Variational Methods 



Variational methods are an important technique for the approximation of com- 
plicated probability distributions, having applications in statistical physics, 
data modelling and neural networks. 

► 33.1 Variational free energy minimization 

One method for approximating a complex distribution in a physical system is 
mean field theory. Mean field theory is a special case of a general variational 
free energy approach of Feynman and Bogoliubov which we will now study. 
The key piece of mathematics needed to understand this method is Gibbs' 

inequality, which we repeat here. Gibbs' inequality first appeared in 

equation (1.24); see also 

The relative entropy between two probability distributions Q{x) and P{x) exercise 2.26 (p. 37). 
that are defined over the same alphabet Ax is 



The relative entropy satisfies -Dkl(QH-P) > (Gibbs' inequality) with 
equality only if Q = P. In general D KL (Q\\P) ^ D KL {P\\Q). 

In this chapter we will replace the log by In, and measure the divergence 
in nats. 

Probability distributions in statistical physics 

In statistical physics one often encounters probability distributions of the form 




X 



(33.1) 





(33.3) 



The partition function (normalizing constant) is 




(33.4) 



X 



The probability distribution of equation (33.2) is complex. Not unbearably 
complex - we can, after all, evaluate i?(x; J) for any particular x in a time 
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33.1: Variational free energy minimization 



polynomial in the number of spins. But evaluating the normalizing constant 
Z(j3,J) is difficult, as we saw in Chapter 29, and describing the properties of 
the probability distribution is also hard. Knowing the value of E(x; J) at a 
few arbitrary points x, for example, gives no useful information about what 
the average properties of the system are. 

An evaluation of Z(f3, J) would be particularly desirable because from Z 
we can derive all the thermodynamic properties of the system. 

Variational free energy minimization is a method for approximating the 
complex distribution P(x) by a simpler ensemble Q(x; 6) that is parameterized 
by adjustable parameters 0. We adjust these parameters so as to get Q to 
best approximate P, in some sense. A by-product of this approximation is a 
lower bound on Z(/3, J). 



The variational free energy 

The objective function chosen to measure the quality of the approximation is 
the variational free energy 

^ W ^Q(x; g )ln exp[ ^) ;J)] . (33.5) 

This expression can be manipulated into a couple of interesting forms: first, 
0F{8) = /?£Q(x;0)£;(x;J)-£Q(x;0)ln^l^ (33.6) 

= 0{E{ X ;3)) Q -S Q , (33.7) 

where (E(x;J))q is the average of the energy function under the distribution 
Q(x;0), and Sq is the entropy of the distribution Q(x;0) (we set ks to one 
in the definition of S so that it is identical to the definition of the entropy H 
in Part I). 

Second, we can use the definition of P(x | (3, J) to write: 

0F(O) = £ Q(x;0)ln^^-lnZ(/?,J) (33.8) 

= Dkl(Q\\P) + 0F, (33.9) 

where F is the true free energy, defined by 

/3F= -lnZ(/3,J), (33.10) 

and Dkl(Q\\P) is the relative entropy between the approximating distribution 
Q(x;6) and the true distribution P(x|/3,J). Thus by Gibbs' inequality, the 
variational free energy F(0) is bounded below by F and only attains this value 
for Q(x;0) = P(x|/?,J). 

Our strategy is thus to vary 8 in such a way that (3F(6) is minimized. 
The approximating distribution then gives a simplified approximation to the 
true distribution that may be useful, and the value of (3F{0) will be an upper 
bound for (3F. Equivalently, Z = e~P Fi fi) is a lower bound for Z. 

Can the objective function j3F be evaluated? 

We have already agreed that the evaluation of various interesting sums over x 
is intractable. For example, the partition function 



Z = ^exp(-/3£(x;J)), 



(33.11) 
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the energy 



and the entropy 



(B)p4E E & J ) exp(-/3£(x; J)) , (33.12) 



S^P(x|/3,J)ln p(x|AJ) (33.13) 



are all presumed to be impossible to evaluate. So why should we suppose 
that this objective function f3F(0), which is also defined in terms of a sum 
over all x (33.5), should be a convenient quantity to deal with? Well, for a 
range of interesting energy functions, and for sufficiently simple approximating 
distributions, the variational free energy can be efficiently evaluated. 

► 33.2 Variational free energy minimization for spin systems 

An example of a tractable variational free energy is given by the spin system 
whose energy function was given in equation (33.3), which we can approximate 
with a separable approximating distribution, 

^ X ' a ^ = ^j) 6XP (^2 anX "j • (33.14) 

The variational parameters 8 of the variational free energy (33.5) are the 
components of the vector a. To evaluate the variational free energy we need 
the entropy of this distribution, 

and the mean of the energy, 

(£(x; J)) Q = £ Q(x; a)£(x; J). (33.16) 

X 

The entropy of the separable approximating distribution is simply the sum of 
the entropies of the individual spins (exercise 4.2, p. 68), 

SQ = E^2 e) (</«), (33.17) 

n 

where q n is the probability that spin n is +1, 



1 



e a„ _|_ e o„ i _|_ exp(— 2a n ) ' 

and 



(33.18) 



H i 2 e) {q) = q\n- + {l-q)ln 1 (33.19) 



(1-9)' 



XrnXr, is a sum 



The mean energy under Q is easy to obtain because ~^2 m _ J„ 
of terms each involving the product of two independent random variables. 
(There are no self-couplings, so J mn = when m = n.) If we define the mean 
value of x n to be x n , which is given by 

— = tanh(o n ) = 2q n - 1, (33.20) 



gflri _|_ g CLn 
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we obtain 



<£(x;J)) Q = £ Q(x;a) 



(33.21) 
(33.22) 



m,n 



So the variational free energy is given by 



/3F{a) = 13 (£(x; 3)) Q -S q = p ( - ^ J mn x m x n - ^ /i„x„ J i/< e) (q n ) . 

\ m.n n / n 

(33.23) 

We now consider minimizing this function with respect to the variational 
parameters a. If q = 1/(1 + e~ 2a ), the derivative of the entropy is 



£-HZ(q)= In l —S- = -2a. 
dq q 



(33.24) 



So we obtain 




0F(a) = p 



= 2 



2 ^gw\ _ /^l - q m \ ( dq n 

dan 



dqn 
da„ 



P I ^ ] JmnXn ~f ^» 



(33.25) 



This derivative is equal to zero when 



&m — P I ^ ] Jmn x n ~\~ h n 
V » 



(33.26) 



So -F(a) is extremized at any point that satisfies equation (33.26) and 

x n = tanh(a n ). 



(33.27) 



The variational free energy F{&) may be a multimodal function, in which 
case each stationary point (maximum, minimum or saddle) will satisfy equa- 
tions (33.26) and (33.27). One way of using these equations, in the case of a 
system with an arbitrary coupling matrix J, is to update each parameter a m 
and the corresponding value of x m using equation (33.26), one at a time. This 
asynchronous updating of the parameters is guaranteed to decrease (3F(a). 

Equations (33.26) and (33.27) may be recognized as the mean field equa- 
tions for a spin system. The variational parameter a n may be thought of as 
the strength of a fictitious field applied to an isolated spin n. Equation (33.27) 
describes the mean response of spin n, and equation (33.26) describes how the 
field a m is set in response to the mean state of all the other spins. 

The variational free energy derivation is a helpful viewpoint for mean field 
theory for two reasons. 

1. This approach associates an objective function (3F with the mean field 
equations; such an objective function is useful because it can help identify 
alternative dynamical systems that minimize the same function. 




Figure 33.1. The variational free 
energy of the two-spin system 
whose energy is -E(x) = — x 1 X2, as 
a function of the two variational 
parameters qi and q 2 . The 
inverse-temperature is (5 = 1.44. 
The function plotted is 

f)F = -Px 1 x 2 -H^(q 1 )-H ( 2 e) (q 2 ), 

where x n = 2q n — 1. Notice that 
for fixed q 2 the function is 
convex ^ with respect to qi , and 
for fixed qi it is convex with 
respect to q 2 . 
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0.5 



-0.5 




h = 0.00 
h = 0.40 
h = 0.80 



Figure 33.2. Solutions of the 
variational free energy 
extremization problem for the 
Ising model, for three different 
applied fields h. Horizontal axis: 
temperature T = 1/(3. Vertical 
axis: magnetization x. The 
critical temperature found by 
mean field theory is T c mft = 4. 



1 



2. The theory is readily generalized to other approximating distributions. 
We can imagine introducing a more complex approximation Q(x; 0) that 
might for example capture correlations among the spins instead of mod- 
elling the spins as independent. One could then evaluate the variational 
free energy and optimize the parameters 6 of this more complex approx- 
imation. The more degrees of freedom the approximating distribution 
has, the tighter the bound on the free energy becomes. However, if the 
complexity of an approximation is increased, the evaluation of either the 
mean energy or the entropy typically becomes more challenging. 



► 33.3 Example: mean field theory for the ferromagnetic Ising model 

In the simple Ising model studied in Chapter 31, every coupling J mn is equal 
to J if m and n are neighbours and zero otherwise. There is an applied 
field h n = h that is the same for all spins. A very simple approximating 
distribution is one with just a single variational parameter a, which defines a 
separable distribution 

Q(x; a) = ±- exp ^ ax}j (33.28) 
in which all spins are independent and have the same probability 

1 + exp(-2o) 

of being up. The mean magnetization is 

x = tanh(o) (33.30) 

and the equation (33.26) which defines the minimum of the variational free 
energy becomes 

a = /3(CJx+h), (33.31) 

where C is the number of couplings that a spin is involved in - C = 4 in the 
case of a rectangular two-dimensional Ising model. We can solve equations 
(33.30) and (33.31) for x numerically - in fact, it is easiest to vary x and solve 
for f3 - and obtain graphs of the free energy minima and maxima as a function 
of temperature as shown in figure 33.2. The solid line shows x versus T = 1/(3 
for the case C = 4, J = 1. 

When h = 0, there is a pitchfork bifurcation at a critical temperature T™ ft . 
[A pitchfork bifurcation is a transition like the one shown by the solid lines in 
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figure 33.2, from a system with one minimum as a function of a (on the right) 
to a system (on the left) with two minima and one maximum; the maximum 
is the middle one of the three lines. The solid lines look like a pitchfork.] 
Above this temperature, there is only one minimum in the variational free 
energy, at a = and x = 0; this minimum corresponds to an approximating 
distribution that is uniform over all states. Below the critical temperature, 
there are two minima corresponding to approximating distributions that are 
symmetry-broken, with all spins more likely to be up, or all spins more likely 
to be down. The state x = persists as a stationary point of the variational 
free energy, but now it is a local maximum of the variational free energy. 

When h > 0, there is a global variational free energy minimum at any 
temperature for a positive value of x, shown by the upper dotted curves in 
figure 33.2. As long as h < JC, there is also a second local minimum in the 
free energy, if the temperature is sufficiently small. This second minimum cor- 
responds to a self-preserving state of magnetization in the opposite direction 
to the applied field. The temperature at which the second minimum appears 
is smaller than T c mft , and when it appears, it is accompanied by a saddle point 
located between the two minima. A name given to this type of bifurcation is 
a saddle-node bifurcation. 

The variational free energy per spin is given by 

(3F = [3(-^Jx 2 -h^j -H { 2 e) 

Exercise 33. 1.^ Sketch the variational free energy as a function of its one 
parameter x for a variety of values of the temperature T and the applied 
field h. 

Figure 33.2 reproduces the key properties of the real Ising system - that, 
for h = 0, there is a critical temperature below which the system has long- 
range order, and that it can adopt one of two macroscopic states. However, 
by probing a little more we can reveal some inadequacies of the variational 
approximation. To start with, the critical temperature T™ ft is 4, which is 
nearly a factor of 2 greater than the true critical temperature T c = 2.27. Also, 
the variational model has equivalent properties in any number of dimensions, 
including d = 1, where the true system does not have a phase transition. So 
the bifurcation at T™ ft should not be described as a phase transition. 

For the case ft = we can follow the trajectory of the global minimum as 
a function of f3 and find the entropy, heat capacity and fluctuations of the ap- 
proximating distribution and compare them with those of a real 8x8 fragment 
using the matrix method of Chapter 31. As shown in figure 33.3, one of the 
biggest differences is in the fluctuations in energy. The real system has large 
fluctuations near the critical temperature, whereas the approximating distri- 
bution has no correlations among its spins and thus has an energy-variance 
which scales simply linearly with the number of spins. 




► 33.4 Variational methods in inference and data modelling 

In statistical data modelling we are interested in the posterior probability 
distribution of a parameter vector w given data D and model assumptions H, 
P(w\D,H). 

WD.H>= ™~™™*K (33.33, 

In traditional approaches to model fitting, a single parameter vector w is op- 
timized to find the mode of this distribution. What is really of interest is 
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Figure 33.3. Comparison of 
approximating distribution's 
properties with those of a real 
8x8 fragment. Notice that the 
variational free energy of the 
approximating distribution is 
indeed an upper bound on the 
free energy of the real system. All 
quantities are shown 'per spin'. 
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33.5: The case of an unknown Gaussian 

the whole distribution. We may also be interested in its normalizing constant 
P(D | Ti) if we wish to do model comparison. The probability distribution 
P(w | D,Ti) is often a complex distribution. In a variational approach to in- 
ference, we introduce an approximating probability distribution over the pa- 
rameters, Q(w; 8), and optimize this distribution (by varying its own param- 
eters 6) so that it approximates the posterior distribution of the parameters 
P(w\D,H) well. 

One objective function we may choose to measure the quality of the ap- 
proximation is the variational free energy 

F W = /d*wQ(w; fl )ln p(J|w y (w|Tf) . (33.34) 

The denominator P(D | w, 7{)P(yv \ Ti) is, within a multiplicative constant, the 
posterior probability P(w\D,H) = P(D\w,H)P(w\H)/P(D\H). So the 
variational free energy F(6) can be viewed as the sum of — mP(D | Ti) and 
the relative entropy between Q(w; 0) and P(w | D, Ti.). F{6) is bounded below 
by - In P(D | Ti) and only attains this value for Q(w; 0) = P(w | D, H). For 
certain models and certain approximating distributions, this free energy, and 
its derivatives with respect to the approximating distribution's parameters, 
can be evaluated. 

The approximation of posterior probability distributions using variational 
free energy minimization provides a useful approach to approximating Bayesian 
inference in a number of fields ranging from neural networks to the decoding of 
error-correcting codes (Hinton and van Camp, 1993; Hinton and Zemel, 1994; 
Dayan et ai, 1995; Neal and Hinton, 1998; MacKay, 1995a). The method 
is sometimes called ensemble learning to contrast it with traditional learning 
processes in which a single parameter vector is optimized. Another name for 
it is variational Bayes. Let us examine how ensemble learning works in the 
simple case of a Gaussian distribution. 



► 33.5 The case of an unknown Gaussian: approximating the posterior 
distribution of fi and a 

We will fit an approximating ensemble Q(fi,a) to the posterior distribution 
that we studied in Chapter 24, 



(33.36) 



We make the single assumption that the approximating ensemble is separable 
in the form Q(^,,a) = Q ^{^)Q a {a) . No restrictions on the functional form of 
Qfj.{n) and <5o-(c) are made. 

We write down a variational free energy, 

HQ) = / d,d.g,M«,Mh f( g;M'jg t) . (33.37, 

We can find the optimal separable distribution Q by considering separately 
the optimization of F over Qfj,(fJ.) for fixed QaW), and then the optimization 
of Q a {a) for fixed Q^). 
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Figure 33.4. Optimization of an 
approximating distribution. The 
posterior distribution 
P{n, cr | {x n }), which is the same 
as that in figure 24.1, is shown by 
solid contours, (a) Initial 
condition. The approximating 
distribution Q(p,a) (dotted 
contours) is an arbitrary separable 
distribution, (b) Q M has been 
updated, using equation (33.41). 
(c) Q a has been updated, using 
equation (33.44). (d) updated 
again, (c) Q a updated again, (f) 
Converged approximation (after 
15 iterations). The arrows point 
to the peaks of the two 
distributions, which are at 
a N = 0.45 (for P) and o-jv-i = 0.5 
(for Q). 



Optimization ofQ^/j.) 

As a functional of Qp(/i), F is: 



o> <3u(/w) 



da Q a (a) In P(D \n,a)+ HP(n)/Q^)} 



J d M Q» J da Q a (a)NP±(n - x f + In Q^(a) 



k (33.38) 
(33.39) 



where (3 = 1/a 2 and k denote constants that do not depend on The 
dependence on Q a thus collapses down to a simple dependence on the mean 



I 



da Q a (a)l/a 2 . 



(33.40) 



Now we can recognize the function —Nf3^(/i — x) 2 as the logarithm of a 
Gaussian identical to the posterior distribution for a particular value of (3 = (3. 
Since a relative entropy J Q\a(Q/P) is minimized by setting Q = P, we can 
immediately write down the distribution Q^ pt (^t) that minimizes F for fixed 

Qa- 

QT^) = P(j*\D,P,n) = Normal^; x,a^). (33.41) 
where a^ D = l/(N/3). 

Optimization of Q a (a) 

We represent Q a {a) using the density over /?, Q a {P) = Q<r{a) \da/d(3\. As a The prior P(a) oc 1/a transforms 

to P{(3) oc \/j3. 



functional of Q a {(3), F is (neglecting additive constants): 

F = -J dp Q„(0) J d M Q» In P(D | M , a) + ln[P((3)/Q a (f3)} 



(33.42) 



JdPQ a (P) [{Na 2 ^ D + S)(3/2- (f - l) ln/3 + lnQ ff (/3)] , (33.43) 
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33.6: Interlude 

where the integral over u is performed assuming Q ^{u) = Q^ pt (^<). Here, the (3- 
dependent expression in square brackets can be recognized as the logarithm of 
a gamma distribution over (3 - see equation (23.15) - giving as the distribution 
that minimizes F for fixed Q^: 

Q°P t (/3)=r(/3;6',c / ), (33.44) 

with 

li = l(Nal lD + S) and c' = y. (33.45) 

In figure 33.4, these two update rules (33.41, 33.44) are applied alternately, 
starting from an arbitrary initial condition. The algorithm converges to the 
optimal approximating ensemble in a few iterations. 

Direct solution for the joint optimum Q fl (p)Qa(o~) 

In this problem, we do not need to resort to iterative computation to find 
the optimal approximating ensemble. Equations (33.41) and (33.44) define 
the optimum implicitly. We must simultaneously have cr 2 ^ D = 1/{N(3), and 
{3 = b'c'. The solution is: 

l/p= S/(N- 1). (33.46) 

This is similar to the true posterior distribution of cr, which is a gamma distri- 
bution with c' = ^y^- and 1/6' = 5/2 (see equation 24.13). This true posterior 
also has a mean value of (3 satisfying 1/(3 = S/(N — 1); the only difference is 
that the approximating distribution's parameter c' is too large by 1/2. 



The approximations given by variational free energy minimization 
always tend to be more compact than the true distribution. 



In conclusion, ensemble learning gives an approximation to the posterior 
that agrees nicely with the conventional estimators. The approximate poste- 
rior distribution over (3 is a gamma distribution with mean (3 corresponding 
to a variance of cr 2 = S/(N — 1) = cr 2 ^. And the approximate posterior dis- 
tribution over ix is a Gaussian with mean x and standard deviation a N _ t /y/N. 

The variational free energy minimization approach has the nice prop- 
erty that it is parameterization-independent; it avoids the problem of basis- 
dependence from which MAP methods and Laplace's method suffer. 

A convenient software package for automatic implementation of variational 
inference in graphical models is VIBES (Bishop et al, 2002). It plays the same 
role for variational inference as BUGS plays for Monte Carlo inference. 

► 33.6 Interlude 

One of my students asked: 

How do you ever come up with a useful approximating distribution, 
given that the true distribution is so complex you can't compute 
it directly? 

Let's answer this question in the context of Bayesian data modelling. Let the 
'true' distribution of interest be the posterior probability distribution over a 
set of parameters x, P(x | D). A standard data modelling practice is to find 
a single, 'best-fit' setting of the parameters, x*, for example, by finding the 
maximum of the likelihood function P{D | x), or of the posterior distribution. 
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One interpretation of this standard practice is that the full description of 
our knowledge about x, P(x | D), is being approximated by a delta- function, 
a probability distribution concentrated on x*. From this perspective, any 
approximating distribution Q(x;0), no matter how crummy it is, has to be 
an improvement on the spike produced by the standard method! So even if 
we use only a simple Gaussian approximation, we are doing well. 

We now study an application of the variational approach to a realistic 
example - data clustering. 



► 33.7 K-means clustering and the expectation-maximization algo- 
rithm as a variational method 

In Chapter 20, we introduced the soft K-means clustering algorithm, version 1. 
In Chapter 22, we introduced versions 2 and 3 of this algorithm, and motivated 
the algorithm as a maximum likelihood algorithm. 

K-means clustering is an example of an 'expectation-maximization' (EM) 
algorithm, with the two steps, which we called 'assignment' and 'update', 
being known as the 'E-step' and the 'M-step' respectively. 

We now give a more general view of K-means clustering, due to Neal 
and Hinton (1998), in which the algorithm is shown to optimize a variational 
objective function. Neal and Hinton's derivation applies to any EM algorithm. 



The probability of everything 

Let the parameters of the mixture model - the means, standard deviations, and 
weights - be denoted by 0. For each data point, there is a missing variable (also 
known as a latent variable), the class label k n for that point. The probability 
of everything, given our assumed model H, is 

N 

P({x("),fcX=i,0|W) = P(0\n)l[ [p(^\k n ,d)P(k n \G)] . (33.47) 

n=l 

The posterior probability of everything, given the data, is proportional to the 
probability of everything: 

p({M£i,»I{* (b) }£U,w) = Pi{ p^}fl'\H) H) ■ (33 - 48) 

We now approximate this posterior distribution by a separable distribution 

Qk({k n }Li)Qe(e), (33.49) 
and define a variational free energy in the usual way: 

F{Q k ,Qe)= [ dDe Q*({U£U) QeW In Jf^ } ^ Qb ™ ■ 

(33.50) 

F is bounded below by minus the evidence, lnP({xW}^ =1 1 H). We can now 
make an iterative algorithm with an 'assignment' step and an 'update' step. 
In the assignment step, Qk{{k n }n=i) ls adjusted to reduce F, for fixed Qg; in 
the update step, Qg is adjusted to reduce F, for fixed Qfe- 

If we wish to obtain exactly the soft K-means algorithm, we impose a 
further constraint on our approximating distribution: Qg is constrained to be 
a delta function centred on a point estimate of 0, 8 = 6*: 



Qg(d) = 5(0-0*). 



(33.51) 
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2 n 




Upper bound 



Lower bound 



where 



1 + e- 



< exp(/*a - #|0)) 



n e [0, l] 



l + e" 
[>(!/) -1/2J/2I/ 



> g(^)cxp[(a- i /)/2-A(^)(a 2 -i/ 2 )] 



Unfortunately, this distribution contributes to the variational free energy an 
infinitely large integral Jd 8 Qg(8)lnQg(8), so we'd better leave that term 
out of F, treating it as an additive constant. [Using a delta function Qg is not 
a good idea if our aim is to minimize F\] Moving on, our aim is to derive the 
soft K-means algorithm. 

> bxercise 33.2. [2] Show that, given Qe{6) = 5(8 — 8*), the optimal Qk, in the 

sense of minimizing F, is a separable distribution in which the probabil- 
ity that k n = k is given by the responsibility 

> Exercise 33.3. Show that, given a separable Qk as described above, the op- 

timal 8*, in the sense of minimizing F, is obtained by the update step 
of the soft K-means algorithm. (Assume a uniform prior on 8.) 

Exercise 33.4. ^ We can instantly improve on the infinitely large value of F 
achieved by soft K-means clustering by allowing Qg to be a more general 
distribution than a delta-function. Derive an update step in which Qg is 
allowed to be a separable distribution, a product of Q^d/i}), Q &({&}), 
and Qtt(tt). Discuss whether this generalized algorithm still suffers from 
soft K-means's 'kaboom' problem, where the algorithm glues an ever- 
shrinking Gaussian to one data point. 

Sadly, while it sounds like a promising generalization of the algorithm 
to allow Qg to be a non-delta-function, and the 'kaboom' problem goes 
away, other artefacts can arise in this approximate inference method, 
involving local minima of F. For further reading, see (MacKay, 1997a; 
MacKay, 2001). 



Figure 33.5. Illustration of the 
Jaakkola- Jordan variational 
method. Upper and lower bounds 
on the logistic function (solid line) 



9{o) = 



1 



1 + e _ ° 



These upper and lower bounds are 
exponential or Gaussian functions 
of a, and so easier to integrate 
over. The graph shows the 
sigmoid function and upper and 
lower bounds with /i = 0.505 and 
v = -2.015. 



► 33.8 Variational methods other than free energy minimization 

There are other strategies for approximating a complicated distribution P(x), 
in addition to those based on minimizing the relative entropy between an 
approximating distribution, Q, and P. One approach pioneered by Jaakkola 
and Jordan is to create adjustable upper and lower bounds Q u and Q L to P, 
as illustrated in figure 33.5. These bounds (which are unnormalized densities) 
are parameterized by variational parameters which are adjusted in order to 
obtain the tightest possible fit. The lower bound can be adjusted to maximize 

E^ L ( x )' ( 33 - 52 ) 

X 

and the upper bound can be adjusted to minimize 



X 



(33.53) 
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Using the normalized versions of the optimized bounds we then compute ap- 
proximations to the predictive distributions. Further reading on such methods 
can be found in the references (Jaakkola and Jordan, 2000a; Jaakkola and Jor- 
dan, 2000b; Jaakkola and Jordan, 1996; Gibbs and MacKay, 2000). 



Further reading 

The Bethe and Kikuchi free energies 

In Chapter 26 we discussed the sum-product algorithm for functions of the 
factor-graph form (26.1). If the factor graph is tree- like, the sum-product algo- 
rithm converges and correctly computes the marginal function of any variable 
x n and can also yield the joint marginal function of subsets of variables that 
appear in a common factor, such as x m . 

The sum-product algorithm may also be applied to factor graphs that are 
not tree-like. If the algorithm converges to a fixed point, it has been shown 
that that fixed point is a stationary point (usually a minimum) of a function 
of the messages called the Kikuchi free energy. In the special case where all 
factors in factor graph are functions of one or two variables, the Kikuchi free 
energy is called the Bethe free energy. 

For articles on this idea, and new approximate inference algorithms mo- 
tivated by it, see Yedidia (2000); Yedidia et al. (2000); Welling and Teh 
(2001); Yuille (2001); Yedidia et al. (2001b); Yedidia et al. (2001a). 



► 33.9 Further exercises 

4U Exercise 33.5. [2 ' p ' 435] This exercise explores the assertion, made above, that 
the approximations given by variational free energy minimization al- 
ways tend to be more compact than the true distribution. Consider a 
two dimensional Gaussian distribution P(x) with axes aligned with the 
directions e' 1 ' = (1, 1) and e' 2 -* = (1,-1). Let the variances in these two 
directions be af and a\. What is the optimal variance if this distribution 
is approximated by a spherical Gaussian with variance <7q, optimized by 
variational free energy minimization? If we instead optimized the objec- 
tive function 

what would be the optimal value of <t 2 ? Sketch a contour of the true 
distribution P(x) and the two approximating distributions in the case 
ai/a 2 = 10. 

[Note that in general it is not possible to evaluate the objective func- 
tion G, because integrals under the true distribution -P(x) are usually 
intractable.] 

t^ks bxercise 

33. 6 . [2, P.436] What dQ 

you think of the idea of using a variational 
method to optimize an approximating distribution Q which we then use 
as a proposal density for importance sampling? 

• ' Exercise 33.7. ^ Define the relative entropy or Kullback-Leibler divergence be- 
tween two probability distributions P and Q, and state Gibbs' inequality. 

Consider the problem of approximating a joint distribution P(x, y) by a 
separable distribution Q(x,y) = Q x{ x )Qy{v)- Show that if the objec- 
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tive function for this approximation is 

G(Q x ,Q Y ) = Y t P(x,y)log 2 



P{x,y) 



Qx{x)Q Y {y) 



that the minimal value of G is achieved when Qx and Qy are equal to 
the marginal distributions over x and y. 

Now consider the alternative objective function 

F(Qx, Qy) = £ Qx{x)Qy{v) lo g2 Qx ^ Qy ^ ; 

x,y \ x iy) 

the probability distribution P(x,y) shown in the margin is to be ap- 
proximated by a separable distribution Q(x,y) = Q x( x )QY{y)- State 
the value of F(Qx,Qy) if Qx and Qy are set to the marginal distribu- 
tions over x and y. 

Show that F(Qx, Qy) has three distinct minima, identify those minima, 
and evaluate F at each of them. 



P(x, y) 
1 

y 2 

3 

4 



1 



x 

2 3 



l k l k 

V8 V8 

l k 

l k 



► 33.10 Solutions 

Solution to exercise 33.5 (p. 434). We need to know the relative entropy be- 
tween two one-dimensional Gaussian distributions: 



/ 



da; Normal(a;; 0, <tq) In 



Normal(a;; 0, <jq) 
Normal(a;; 0, ap) 



I 



= / da; Normal(a;; 0, ctq) 



In 



op 




(33.55) 
(33.56) 



So, if we approximate P, whose variances are u\ and cr|, by Q, whose variances 
are both <7q, we find 




differentiating, 



dln(a|)- 



which is zero when 



Q 



1 


-2 + 


9- 


4)1 


2 








1 








" 2 


(4- 


^2/ 





(33.57) 



(33.58) 



(33.59) 



Thus we set the approximating distribution's inverse variance to the mean 
inverse variance of the target distribution P. 

In the case o\ = 10 and u-i = 1, we obtain oq ~ \/2, which is just a factor 
of y/2 larger than pretty much independent of the value of the larger 
standard deviation o\ . Variational free energy minimization typically leads to 
approximating distributions whose length scales match the shortest length scale 
of the target distribution. The approximating distribution might be viewed as 
too compact. 
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(a) 




(b) 




Figure 33.6. Two separable 
Gaussian approximations (dotted 
lines) to a bivariatc Gaussian 
distribution (solid line), (a) The 
approximation that minimizes the 
variational free energy, (b) The 
approximation that minimizes the 
objective function G. In each 
figure, the lines show the contours 
at which x T Ax = 1, where A is 
the inverse covariance matrix of 
the Gaussian. 



In contrast, if we use the objective function G then we find: 



G(4) 




constant, 



where the constant depends on o\ and o~2 only. Differentiating, 



dlncr2 G 2 



Q 









2 - 






u 





which is zero when 



^2 1 / 2 , 2\ 
< J Q= ^ \°\ + °2j ■ 



(33.60) 



(33.61) 



(33.62) 



Thus we set the approximating distribution's variance to the mean variance 
of the target distribution P. 

In the case o\ = 10 and 02 = 1, we obtain <tq ~ 10/\/2, which is just a 
factor of y/2 smaller than 01, independent of the value of 02 . 

The two approximations are shown to scale in figure 33.6. 



Solution to exercise 33.6 (p. 434). The best possible variational approximation 
is of course the target distribution P. Assuming that this is not possible, a 
good variational approximation is more compact than the true distribution. 
In contrast, a good sampler is more heavy tailed than the true distribution. 
An over-compact distribution would be a lousy sampler with a large variance. 
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34 



Independent Component Analysis and 
Latent Variable Modelling 



► 34.1 Latent variable models 



Many statistical models are generative models (that is, models that specify 
a full probability density over all variables in the situation) that make use of 
latent variables to describe a probability distribution over observables. 

Examples of latent variable models include Chapter 22's mixture models, 
which model the observables as coming from a superposed mixture of simple 
probability distributions (the latent variables are the unknown class labels 
of the examples); hidden Markov models (Rabiner and Juang, 1986; Durbin 
et ai, 1998); and factor analysis. 

The decoding problem for error-correcting codes can also be viewed in 
terms of a latent variable model - figure 34.1. In that case, the encoding 
matrix G is normally known in advance. In latent variable modelling, the 
parameters equivalent to G are usually not known, and must be inferred from 
the data along with the latent variables s. 

Usually, the latent variables have a simple distribution, often a separable 
distribution. Thus when we fit a latent variable model, we are finding a de- 
scription of the data in terms of 'independent components'. The 'independent 
component analysis' algorithm corresponds to perhaps the simplest possible 
latent variable model with continuous latent variables. 



si 

oooooooo 





m w h w m w m 

oooooooooooooooo 

2/1 

Figure 34.1. Error-correcting 
codes as latent variable models. 
The K latent variables are the 
independent source bits 
si, . . . , sk; these give rise to the 
observables via the generator 
matrix G. 



► 34.2 The generative model for independent component analysis 

A set of N observations D = {x^}^ =1 are assumed to be generated as follows. 
Each J-dimensional vector x is a linear mixture of I underlying source signals, 
s: 

x=Gs, (34.1) 

where the matrix of mixing coefficients G is not known. 

The simplest algorithm results if we assume that the number of sources 
is equal to the number of observations, i.e., I = J. Our aim is to recover 
the source variables s (within some multiplicative factors, and possibly per- 
muted). To put it another way, we aim to create the inverse of G (within a 
post-multiplicative factor) given only a set of examples {x}. We assume that 
the latent variables are independently distributed, with marginal distributions 
P(si \H) = Pi(si). Here H denotes the assumed form of this model and the 
assumed probability distributions pi of the latent variables. 

The probability of the observables and the hidden variables, given G and 
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7i, is: 

N 

P({xW, S W}^ =1 |G,H) = Yl [P{* (n) |sW,G,W)P(sW |W)] (34.2) 

n=l 



JV 

n 

n=l 



(34.3) 



We assume that the vector x is generated without noise. This assumption is 
not usually made in latent variable modelling, since noise-free data are rare; 
but it makes the inference problem far simpler to solve. 

The likelihood function 

For learning about G from the data D, the relevant quantity is the likelihood 
function 

P(D\G,H) = ]JP{^ n) \G,H) (34.4) 

n 

which is a product of factors each of which is obtained by marginalizing over 
the latent variables. When we marginalize over delta functions, remember 
that J ds S(x — vs)f(s) = ^f(x/v). We adopt summation convention at this 

point, such that, for example, Gjisf 1 ^ = J2i Gjisf\ A single factor in the 
likelihood is given by 

P(x< n >|G,W) = y dV") P( X W |s< n \G,W)P(s< n > \H) (34.5) 

= / dV") n 5 - Gasf) UpM n) ) (34-6) 

= ]dSG|II ft( ^ (347) 
-ln|detG| + ^lnp^G- 1 ^). (34.8) 



lnP(xW \G,H) 



To obtain a maximum likelihood algorithm we find the gradient of the log 
likelihood. If we introduce W = G _1 , the log likelihood contributed by a 
single example may be written: 



lnP(x (n) \G,H)= In |det W| + ^\a. Pi {W ij x j ). 



(34.9) 



We'll assume from now on that detW is positive, so that we can omit the 
absolute value sign. We will need the following identities: 



Let us define 







dG 



IndetG 



dGji 



G7l 



d 





(34.10) 


= ~G h l G^ = -W l3 W m 


(34.11) 


Gjm {dG lm f ) Gli - 


(34.12) 



WijXj, 



(ai) = din pi(ai)/dai, 



(34.13) 
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Repeat for each datapoint x: 

1. Put x through a linear mapping: 

a = Wx. 

2. Put a through a nonlinear map: 

Zi = <t>i{a,i), 

where a popular choice for <fi is <fi = — tanh(ai). 

3. Adjust the weights in accordance with 

AWoc [W T ] _1 + zx T . 



Algorithm 34.2. Independent 
component analysis - online 
steepest ascents version. 
See also algorithm 34.4, which is 
to be preferred. 



and Zi = 4>i((ii), which indicates in which direction a, needs to change to make 
the probability of the data greater. We may then obtain the gradient with 
respect to Gji using equations (34.10) and (34.11): 



d 



lnP(x (n) \G,H) = -Wij - aiZ V W v . 



Or alternatively, the derivative with respect to W%j m . 
3 



dWi 



■lnP(x( n ) \G,H) = Gji + XjZi. 



(34.14) 



(34.15) 



If we choose to change W so as to ascend this gradient, we obtain the learning 
rule 

AW oc [W T ] _1 + zx T . (34.16) 
The algorithm so far is summarized in algorithm 34.2. 

Choices of 4> 

The choice of the function (p defines the assumed prior distribution of the 
latent variable s. 

Let's first consider the linear choice (f>i{ a i) = — Ka i, which implicitly (via 
equation 34.13) assumes a Gaussian distribution on the latent variables. The 
Gaussian distribution on the latent variables is invariant under rotation of the 
latent variables, so there can be no evidence favouring any particular alignment 
of the latent variable space. The linear algorithm is thus uninteresting in that 
it will never recover the matrix G or the original sources. Our only hope is 
thus that the sources are non-Gaussian. Thankfully, most real sources have 
non-Gaussian distributions; often they have heavier tails than Gaussians. 

We thus move on to the popular tanh nonlinearity. If 



k{ai) = ~ tanh(ai) 



(34.17) 



then implicitly we are assuming 



Pi(si) oc l/cosh(sj) oc 



£ Si _|_ g Si 



(34.18) 



This is a heavier-tailed distribution for the latent variables than the Gaussian 
distribution. 
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(a) 




(c) 




(b) 



30r- 

20 

10- 
x2 0- 
-10- 
-20- • 
-3ol 



, -30 -20 -10 10 20 30 

(d) 



We could also use a tanh nonlinearity with gain (3, that is, (f>i(ai) = 
— tanh(/3oi), whose implicit probabilistic model is Pi(si) oc l/[cosh(/3si)] 1 ' / ' 3 . In 
the limit of large f3, the nonlinearity becomes a step function and the probabil- 
ity distribution pi (s^ becomes a biexponential distribution, Pi(si) oc exp( — |s|). 
In the limit /? — > 0, Pi(si) approaches a Gaussian with mean zero and variance 
1//3. Heavier-tailed distributions than these may also be used. The Student 
and Cauchy distributions spring to mind. 



variables, for G 



G 



(broader 



Figure 34.3. Illustration of the 
generative models implicit in the 
learning algorithm. 

(a) Distributions over two 
observables generated by 1 / cosh 
distributions on the_ latent 

3/4 1/2 
1/2 1 
(compact distribution) and 
2 -1 
-1 3/2 
distribution), (b) Contours of the 
generative distributions when the 
latent variables have Cauchy 
distributions. The learning 
algorithm fits this amoeboid 
object to the empirical data in 
such a way as to maximize the 
likelihood. The contour plot in 

(b) does not adequately represent 
this heavy-tailed distribution. 

(c) Part of the tails of the Cauchy 
distribution, giving the contours 
0.01 ... 0.1 times the density at 
the origin, (d) Some data from 
one of the generative distributions 
illustrated in (b) and (c). Can you 
tell which? 200 samples were 
created, of which 196 fell in the 
plotted region. 



Example distributions 

Figures 34.3(a-c) illustrate typical distributions generated by the independent 
components model when the components have 1/cosh and Cauchy distribu- 
tions. Figure 34.3d shows some samples from the Cauchy model. The Cauchy 
distribution, being the more heavy-tailed, gives the clearest picture of how the 
predictive distribution depends on the assumed generative parameters G. 



► 34.3 A covariant, simpler, and faster learning algorithm 

We have thus derived a learning algorithm that performs steepest descents 
on the likelihood function. The algorithm does not work very quickly, even 
on toy data; the algorithm is ill-conditioned and illustrates nicely the general 
advice that, while finding the gradient of an objective function is a splendid 
idea, ascending the gradient directly may not be. The fact that the algorithm is 
ill-conditioned can be seen in the fact that it involves a matrix inverse, which 
can be arbitrarily large or even undefined. 



Covariant optimization in general 

The principle of covariance says that a consistent algorithm should give the 
same results independent of the units in which quantities are measured (Knuth, 
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1968). A prime example of a non-covariant algorithm is the popular steepest 
descents rule. A dimensionless objective function L(w) is defined, its deriva- 
tive with respect to some parameters w is computed, and then w is changed 
by the rule 



This popular equation is dimensionally inconsistent: the left-hand side of this 
equation has dimensions of [wj] and the right-hand side has dimensions 
The behaviour of the learning algorithm (34.19) is not covariant with respect 
to linear rescaling of the vector w. Dimensional inconsistency is not the end of 
the world, as the success of numerous gradient descent algorithms has demon- 
strated, and indeed if n decreases with n (during on-line learning) as 1/n then 
the Munro-Robbins theorem (Bishop, 1992, p. 41) shows that the parameters 
will asymptotically converge to the maximum likelihood parameters. But the 
non-covariant algorithm may take a very large number of iterations to achieve 
this convergence; indeed many former users of steepest descents algorithms 
prefer to use algorithms such as conjugate gradients that adaptively figure 
out the curvature of the objective function. The defense of equation (34.19) 
that points out n could be a dimensional constant is untenable if not all the 
parameters W{ have the same dimensions. 

The algorithm would be covariant if it had the form 



where M is a positive-definite matrix whose i, i' element has dimensions [lUjiUj/]. 
From where can we obtain such a matrix? Two sources of such matrices are 
metrics and curvatures. 

Metrics and curvatures 

If there is a natural metric that defines distances in our parameter space w, 
then a matrix M can be obtained from the metric. There is often a natural 
choice. In the special case where there is a known quadratic metric defining 
the length of a vector w, then the matrix can be obtained from the quadratic 
form. For example if the length is w 2 then the natural matrix is M = I, and 
steepest descents is appropriate. 

Another way of finding a metric is to look at the curvature of the objective 
function, defining A = — VVL (where V = d/dw). Then the matrix M = 
A -1 will give a covariant algorithm; what is more, this algorithm is the Newton 
algorithm, so we recognize that it will alleviate one of the principal difficulties 
with steepest descents, namely its slow convergence to a minimum when the 
objective function is at all ill-conditioned. The Newton algorithm converges 
to the minimum in a single step if L is quadratic. 

In some problems it may be that the curvature A consists of both data- 
dependent terms and data-independent terms; in this case, one might choose 
to define the metric using the data-independent terms only (Gull, 1989). The 
resulting algorithm will still be covariant but it will not implement an exact 
Newton step. Obviously there are many covariant algorithms; there is no 
unique choice. But covariant algorithms are a small subset of the set of all 
algorithms! 




(34.19) 




(34.20) 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



442 34 — Independent Component Analysis and Latent Variable Modelling 

Back to independent component analysis 

For the present maximum likelihood problem we have evaluated the gradient 
with respect to G and the gradient with respect to W = G _1 . Steepest 
ascents in W is not covariant. Let us construct an alternative, covariant 
algorithm with the help of the curvature of the log likelihood. Taking the 
second derivative of the log likelihood with respect to W we obtain two terms, 
the first of which is data-independent: 

and the second of which is data-dependent: 

d(zjXj) 



XjXi5ikz[, (no sum over i) (34.22) 



where z' is the derivative of z. It is tempting to drop the data-dependent term 
and define the matrix M by [M^ 1 ]^^^ = [Gj k Gu]. However, this matrix 
is not positive definite (it has at least one non-positive eigenvalue), so it is 
a poor approximation to the curvature of the log likelihood, which must be 
positive definite in the neighbourhood of a maximum likelihood solution. We 
must therefore consult the data-dependent term for inspiration. The aim is 
to find a convenient approximation to the curvature and to obtain a covariant 
algorithm, not necessarily to implement an exact Newton step. What is the 
average value of XjXibi k z'(l If the true value of G is G*, then 

(xjXiSikz'i) = (G* m s m s„Gf„<5 ifc 2') . (34.23) 

We now make several severe approximations: we replace G* by the present 
value of G, and replace the correlated average {s m s n z'^ by (s m s n ) (z'j) = 
E mn Di. Here S is the variance-covariance matrix of the latent variables 
(which is assumed to exist), and Di is the typical value of the curvature 
d 2 In pi (a) /da 2 . Given that the sources are assumed to be independent, X 
and D are both diagonal matrices. These approximations motivate the ma- 
trix M given by: 

[M~ }(ij)(ki) = Gj m £ mn Gi n 5 ik D i7 (34.24) 

that is, 

M mkl) = W mj Z^ n W nl 5 ik D~\ (34.25) 

For simplicity, we further assume that the sources are similar to each other so 
that £ and D are both homogeneous, and that SD = 1. This will lead us to 
an algorithm that is covariant with respect to linear rescaling of the data x, 
but not with respect to linear rescaling of the latent variables. We thus use: 

M (ij)m = W m3 W m i5 lk . (34.26) 

Multiplying this matrix by the gradient in equation (34.15) we obtain the 
following covariant learning algorithm: 

AWij = r, (Wij + W Vj a v z t ) . (34.27) 

Notice that this expression does not require any inversion of the matrix W. 
The only additional computation once z has been computed is a single back- 
ward pass through the weights to compute the quantity 



x'j = W Vi a v 



(34.28) 
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34.3: A covariant, simpler, and faster learning algorithm 

Repeat for each datapoint x: 

1. Put x through a linear mapping: 

a = Wx. 

2. Put a through a nonlinear map: 

Zi = 4>i{a,i), 

where a popular choice for <j) is <t> = — tanh(a,). 

3. Put a back through W: 

x' = W T a. 

4. Adjust the weights in accordance with 

AW oc W + zx /T . 
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Algorithm 34.4. Independent 
component analysis - covariant 
version. 



in terms of which the covariant algorithm reads: 

AWij = n (Wij + x'^) . (34.29) 

The quantity {Wij + x'jZij on the right-hand side is sometimes called the 
natural gradient. The covariant independent component analysis algorithm is 
summarized in algorithm 34.4. 

Further reading 

ICA was originally derived using an information maximization approach (Bell 
and Sejnowski, 1995). Another view of ICA, in terms of energy functions, 
which motivates more general models, is given by Hinton et al. (2001). Another 
generalization of ICA can be found in Pearlmutter and Parra (1996, 1997). 
There is now an enormous literature on applications of ICA. A variational free 
energy minimization approach to ICA-like models is given in (Miskin, 2001; 
Miskin and MacKay, 2000; Miskin and MacKay, 2001). Further reading on 
blind separation, including non-ICA algorithms, can be found in (Jutten and 
Herault, 1991; Comon et al, 1991; Hendin et al, 1994; Amari et al, 1996; 
Hojen-Sorensen et al, 2002). 

Infinite models 

While latent variable models with a finite number of latent variables are widely 
used, it is often the case that our beliefs about the situation would be most 
accurately captured by a very large number of latent variables. 

Consider clustering, for example. If we attack speech recognition by mod- 
elling words using a cluster model, how many clusters should we use? The 
number of possible words is unbounded (section 18.2), so we would really like 
to use a model in which it's always possible for new clusters to arise. 

Furthermore, if we do a careful job of modelling the cluster corresponding 
to just one English word, we will probably find that the cluster for one word 
should itself be modelled as composed of clusters - indeed, a hierarchy of 
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clusters within clusters. The first levels of the hierarchy would divide male 
speakers from female, and would separate speakers from different regions - 
India, Britain, Europe, and so forth. Within each of those clusters would be 
subclusters for the different accents within each region. The subclusters could 
have subsubclusters right down to the level of villages, streets, or families. 

Thus we would often like to have infinite numbers of clusters; in some 
cases the clusters would have a hierarchical structure, and in other cases the 
hierarchy would be flat. So, how should such infinite models be implemented 
in finite computers? And how should we set up our Bayesian models so as to 
avoid getting silly answers? 

Infinite mixture models for categorical data are presented in Neal (1991), 
along with a Monte Carlo method for simulating inferences and predictions. 
Infinite Gaussian mixture models with a flat hierarchical structure are pre- 
sented in Rasmussen (2000). Neal (2001) shows how to use Dirichlet diffusion 
trees to define models of hierarchical clusters. Most of these ideas build on 
the Dirichlet process (section 18.2). This remains an active research area 
(Rasmussen and Ghahramani, 2002; Beal et al, 2002). 



► 34.4 Exercises 

Exercise 34. 1. Repeat the derivation of the algorithm, but assume a small 
amount of noise in x: x = Gs + n; so the term 5 {x^ — J2i ^j» s i"^) 
in the joint probability (34.3) is replaced by a probability distribution 
over xj 1 with mean VV G,jS.- . Show that, if this noise distribution has 
sufficiently small standard deviation, the identical algorithm results. 

Exercise 34.2. ^ Implement the covariant ICA algorithm and apply it to toy 
data. 

Exercise 34.3. Create algorithms appropriate for the situations: (a) x in- 
cludes substantial Gaussian noise; (b) more measurements than latent 
variables (J > I); (c) fewer measurements than latent variables (J < I). 

Factor analysis assumes that the observations x can be described in terms of 
independent latent variables {s^} and independent additive noise. Thus the 
observable x is given by 

x = Gs + n, (34.30) 

where n is a noise vector whose components have a separable probability distri- 
bution. In factor analysis it is often assumed that the probability distributions 
of {sk} and {rii} are zero-mean Gaussians; the noise terms may have different 
variances of. 

Exercise 34.4.^ Make a maximum likelihood algorithm for inferring G from 
data, assuming the generative model x = Gs + n is correct and that s 
and n have independent Gaussian distributions. Include parameters <r| 
to describe the variance of each rij, and maximize the likelihood with 
respect to them too. Let the variance of each s, be 1. 



Exercise 34.5. " ' Implement the infinite Gaussian mixture model of Rasmussen 
(2000). 
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35 



Random Inference Topics 



► 35.1 What do you know if you are ignorant? 

Example 35.1. A real variable x is measured in an accurate experiment. For 
example, x might be the half-life of the neutron, the wavelength of light 
emitted by a firefly, the depth of Lake Vostok, or the mass of Jupiter's 
moon Io. 

What is the probability that the value of x starts with a '1', like the 
charge of the electron (in S.I. units), 



e = 1.602. 
and the Boltzmann constant, 



x 1(T 19 C, 



k= 1.380 66... x KT^JK -1 ? 

And what is the probability that it starts with a '9', like the Faraday 
constant, 

T = 9.648 . . . x 10 4 Cmor 1 ? 

What about the second digit? What is the probability that the mantissa 
of x starts '1.1. . . ', and what is the probability that x starts '9.9. . . ' ? 

Solution. An expert on neutrons, fireflies, Antarctica, or Jove might be able to 
predict the value of x, and thus predict the first digit with some confidence, but 
what about someone with no knowledge of the topic? What is the probability 
distribution corresponding to 'knowing nothing' ? 

One way to attack this question is to notice that the units of x have not 
been specified. If the half-life of the neutron were measured in fortnights 
instead of seconds, the number x would be divided by 1 209 600; if it were 
measured in years, it would be divided by 3 x 10 7 . Now, is our knowledge 
about x, and, in particular, our knowledge of its first digit, affected by the 
change in units? For the expert, the answer is yes; but let us take someone 
truly ignorant, for whom the answer is no; their predictions about the first digit 
of x are independent of the units. The arbitrariness of the units corresponds to 
invariance of the probability distribution when x is multiplied by any number. 

If you don't know the units that a quantity is measured in, the probability 
of the first digit must be proportional to the length of the corresponding piece 
of logarithmic scale. The probability that the first digit of a number is 1 is 
thus 

log 2 — log 1 log 2 
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Figure 35.1. When viewed on a 
logarithmic scale, scales using 
different units are translated 
relative to each other. 



Pi 



log 10 — log 1 log 10' 



(35.1) 
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Now, 2 10 = 1024 ~ 10 3 = 1000, so without needing a calculator, we have 
10 log 2 ~ 3 log 10 and 

Pi - ^- (35.2) 
More generally, the probability that the first digit is d is 

(Iog(d + 1) - Iog(d))/(Iog 10 - log 1) = log 10 (l + l/d). 



This observation about initial digits is known as Benford's law. 
does not correspond to a uniform probability distribution. 



(35.3) 

Ignorance 
□ 



t> bxercise 35.2. [2] A pin is thrown tumbling in the air. What is the probability 
distribution of the angle B\ between the pin and the vertical at a moment 
while it is in the air? The tumbling pin is photographed. What is the 
probability distribution of the angle #3 between the pin and the vertical 
as imaged in the photograph? 

> Exercise 35.3. Record breaking. Consider keeping track of the world record 
for some quantity x, say earthquake magnitude, or longjump distances 
jumped at world championships. If we assume that attempts to break 
the record take place at a steady rate, and if we assume that the under- 
lying probability distribution of the outcome x, P(x), is not changing - 
an assumption that I think is unlikely to be true in the case of sports 
endeavours, but an interesting assumption to consider nonetheless - and 
assuming no knowledge at all about P(x), what can be predicted about 
successive intervals between the dates when records are broken? 



P(9)$ 



P(3) 



10 
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6 
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- 3 



P(l) 



► 35.2 The Luria-Delbriick distribution 

Exercise 35.4.^' P- 449 l l n their landmark paper demonstrating that bacteria 
could mutate from virus sensitivity to virus resistance, Luria and Delbriick 
(1943) wanted to estimate the mutation rate in an exponentially-growing pop- 
ulation from the total number of mutants found at the end of the experi- 
ment. This problem is difficult because the quantity measured (the number 
of mutated bacteria) has a heavy-tailed probability distribution: a mutation 
occuring early in the experiment can give rise to a huge number of mutants. 
Unfortunately, Luria and Delbriick didn't know Bayes' theorem, and their way 
of coping with the heavy-tailed distribution involves arbitrary hacks leading to 
two different estimators of the mutation rate. One of these estimators (based 
on the mean number of mutated bacteria, averaging over several experiments) 
has appallingly large variance, yet sampling theorists continue to use it and 
base confidence intervals around it (Kepler and Oprea, 2001). In this exercise 
you'll do the inference right. 

In each culture, a single bacterium that is not resistant gives rise, after g 
generations, to N = 2 9 descendants, all clones except for differences arising 
from mutations. The final culture is then exposed to a virus, and the number 
of resistant bacteria n is measured. According to the now accepted mutation 
hypothesis, these resistant bacteria got their resistance from random mutations 
that took place during the growth of the colony. The mutation rate (per cell 
per generation), a, is about one in a hundred million. The total number of 
opportunities to mutate is N, since YliZo 2 1 ~ 2 9 = N. If a bacterium mutates 
at the ith generation, its descendants all inherit the mutation, and the final 
number of resistant bacteria contributed by that one ancestor is 2 g ~ % . 
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Given M separate experiments, in each of which a colony of size N is 
created, and where the measured numbers of resistant bacteria are {n m }^f =1 , 
what can we infer about the mutation rate, a? 

Make the inference given the following dataset from Luria and Delbruck, 
for TV = 2.4 x 10 8 : {n m } = {1, 0, 3, 0, 0, 5, 0, 5, 0, 6, 107, 0, 0, 0, 1, 0, 0, 64, 0, 35}. 
[A small amount of computation is required to solve this problem.] 



► 35.3 Inferring causation 

Exercise 35.5.^' p ' 450 l In the Bayesian graphical model community, the task 
«3sV of inferring which way the arrows point - that is, which nodes are parents, 
and which children - is one on which much has been written. 

Inferring causation is tricky because of 'likelihood equivalence'. Two graph- 
ical models are likelihood-equivalent if for any setting of the parameters of 
either, there exists a setting of the parameters of the others such that the two 
joint probability distributions of all observables are identical. An example of 
a pair of likelihood-equivalent models are A — > B and B — > A. The model 
A — > B asserts that A is the parent of B, or, in very sloppy terminology, 'A 
causes B\ An example of a situation where '£> — > A' is true is the case where 
B is the variable 'burglar in house' and A is the variable 'alarm is ringing'. 
Here it is literally true that B causes A. But this choice of words is confusing if 
applied to another example, R — > D, where R denotes 'it rained this morning' 
and D denotes 'the pavement is dry'. L R causes D' is confusing. I'll therefore 
use the words l B is a parent of A' to denote causation. Some statistical meth- 
ods that use the likelihood alone are unable to use data to distinguish between 
likelihood-equivalent models. In a Bayesian approach, on the other hand, two 
likelihood-equivalent models may nevertheless be somewhat distinguished, in 
the light of data, since likelihood-equivalence does not force a Bayesian to use 
priors that assign equivalent densities over the two parameter spaces of the 
models. 

However, many Bayesian graphical modelling folks, perhaps out of sym- 
pathy for their non-Bayesian colleagues, or from a latent urge not to appear 
different from them, deliberately discard this potential advantage of Bayesian 
methods - the ability to infer causation from data - by skewing their models 
so that the ability goes away; a widespread orthodoxy holds that one should 
identify the choices of prior for which 'prior equivalence' holds, i.e., the priors 
such that models that are likelihood-equivalent also have identical posterior 
probabilities, and then one should use one of those priors in inference and 
prediction. This argument motivates the use, as the prior over all probability 
vectors, of specially-constructed Dirichlet distributions. 

In my view it is a philosophical error to use only those priors such that 
causation cannot be inferred. Priors should be set to describe one's assump- 
tions; when this is done, it's likely that interesting inferences about causation 
can be made from data. 

In this exercise, you'll make an example of such an inference. 

Consider the toy problem where A and B are binary variables. The two 
models are Ha^b an d Hb^a- Ha^b asserts that the marginal probabil- 
ity of A comes from a beta distribution with parameters (1, 1), i.e., the uni- 
form distribution; and that the two conditional distributions P(b\a = 0) and 
P(b\ a = l) also come independently from beta distributions with parameters 
(1,1). The other model assigns similar priors to the marginal probability of 
B and the conditional distributions of A given B. Data are gathered, and the 
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counts, given F = 1000 outcomes, are 



a = a= 1 
: 760 5 
: 1 190 45 



235 



950 50 



What are the posterior probabilities of the two hypotheses? 

Hint: it's a good idea to work this exercise out symbolically in order to spot 
all the simplifications that emerge. 

= lnr(z) ~ \n(x) - ± + 0(l/x 2 ). (35.5) 

The topic of inferring causation is a complex one. The fact that Bayesian 
inference can sensibly be used to infer the directions of arrows in graphs seems 
to be a neglected view, but it is certainly not the whole story. See Pearl (2000) 
for discussion of many other aspects of causality. 



► 35.4 Further exercises 

Exercise 35.6. ^ Photons arriving at a photon detector are believed to be emit- 
ted as a Poisson process with a time-varying rate, 

X(t) =exp(a + &sin(cji + </>)), (35.6) 

where the parameters a, b, w, and <f> are known. Data are collected during 
the time t = . . . T. Given that N photons arrived at times {t n }^ =i , 
discuss the inference of a, b, lu, and <j>. [Further reading: Gregory and 
Loredo (1992).] 

> Exercise 35.7. ^ A data file consisting of two columns of numbers has been 
printed in such a way that the boundaries between the columns are 
unclear. Here are the resulting strings. 



891.10.0 


912.20.0 


874.10.0 


870.20.0 


836.10.0 


861.20.0 


903.10.0 


937.10.0 


850.20.0 


916.20.0 


899.10.0 


907.10.0 


924.20.0 


861.10.0 


899.20.0 


849.10.0 


887.20.0 


840.10.0 


849.20.0 


891.10.0 


916.20.0 


891.10.0 


912.20.0 


875.10.0 


898.20.0 


924.10.0 


950.20.0 


958.10.0 


971.20.0 


933.10.0 


966.20.0 


908.10.0 


924.20.0 


983.10.0 


924.20.0 


908.10.0 


950.20.0 


911.10.0 


913.20.0 


921.25.0 


912.20.0 


917.30.0 


923.50.0 













Discuss how probable it is, given these data, that the correct parsing of 
each item is: 

(a) 891.10.0 -> 891. 10.0, etc. 

(b) 891.10.0 -> 891.1 0.0, etc. 

[A parsing of a string is a grammatical interpretation of the string. For 
example, 'Punch bores' could be parsed as 'Punch (noun) bores (verb)', 
or 'Punch (imperative verb) bores (plural noun)'.] 

> Exercise 35.8. ^ In an experiment, the measured quantities {x n } come inde- 
pendently from a biexponential distribution with mean ji, 



P{x\v) = exp(- \x - n\) , 
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where Z is the normalizing constant, Z - 
An example of this distribution, with fi - 

Assuming the four datapoints are 



2. The mean /j, is not known. 
1, is shown in figure 35.2. 



\X n } = {0, 0.9, 2, 6j, Q1 2345678 

what do these data tell us about n? Include detailed sketches in your 
answer. Give a range of plausible values of /i. 



► 35.5 Solutions 

Solution to exercise 35.4 (p. 446). A population of size N has N opportunities 
to mutate. The probability of the number of mutations that occurred, r, is 
roughly Poisson 

P(r\a,N) = e- aNi -^. (35.7) 

(This is slightly inaccurate because the descendants of a mutant cannot them- 
selves undergo the same mutation.) Each mutation gives rise to a number of 
final mutant cells n, that depends on the generation time of the mutation. If 
multiplication went like clockwork then the probability of n, being 1 would 
be 1/2, the probability of 2 would be 1/4, the probability of 4 would be 1/8, 
and P(rii) = l/(2n) for all ni that are powers of two. But we don't expect 
the mutant progeny to divide in exact synchrony, and we don't know the pre- 
cise timing of the end of the experiment compared to the division times. A 
smoothed version of this distribution that permits all integers to occur is 



P{ni) 



1 1 

Zrf> 



(35.8) 



where Z = 7r 2 /6 = 1.645. [This distribution's moments are all wrong, since 
n, can never exceed N, but who cares about moments? - only sampling 
theory statisticians who are barking up the wrong tree, constructing 'unbiased 
estimators' such as a = (n/N) / log N. The error that we introduce in the 
likelihood function by using the approximation to P(rij) is negligible.] 
The observed number of mutants n is the sum 



Figure 35.2. The biexponcntial 
distribution P(x\ fi = 1). 



(35.9) 



The probability distribution of n given r is the convolution of r identical 
distributions of the form (35.8). For example, 



71-1 



P(n|r = 2)= 



1 1 



1 



' Z 2 n\ (n — ni ) 5 



for n > 2. 



(35.10) 



The probability distribution of n given a, which is what we need for the 
Bayesian inference, is given by summing over r. 



N 



P(n \a)=J2 P ( n \ r ) P ( r I a > N )- 



(35.11) 



r=0 



This quantity can't be evaluated analytically, but for small a, it's easy to 
evaluate to any desired numerical precision by explicitly summing over r from 
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r = to some r max , with P(n \ r) also being found for each r by r max explicit 



convolutions for all required values of n; if r E 



Wmaxj the largest value 



of n encountered in the data, then P(n \ a) is computed exactly; but for this 
question's data, r max = 9 is plenty for an accurate result; I used r max = 
74 to make the graphs in figure 35.3. Octave source code is available. 1 
Incidentally, for data sets like the one in this exercise, which have a substantial 
number of zero counts, very little is lost by making Luria and Delbruck's second 
approximation, which is to retain only the count of how many n were equal to 
zero, and how many were non-zero. The likelihood function found using this 
weakened data set, 

(35.12) 



L(a) = {e- aN ) n {l-e- aN ) 



-aN\9 



is scarcely distinguishable from the likelihood computed using full information. 



Solution to exercise 35.5 (p. 447). From the six terms of the form 



P(F | am) 



r(£i*i + «) n i r(am i )' 



most factors cancel and all that remains is 

P{H a ^b I Data) (765 + 1)(235 



1) 



P{H b ^a I Data) (950 + 1)(50 + 1) 



3.8 



(35.13) 



(35.14) 



There is modest evidence in favour of Ti-A^B because the three probabilities 
inferred for that hypothesis (roughly 0.95, 0.8, and 0.1) are more typical of 
the prior than are the three probabilities inferred for the other (0.24, 0.008, 
and 0.19). This statement sounds absurd if we think of the priors as 'uniform' 
over the three probabilities - surely, under a uniform prior, any settings of the 
probabilities are equally probable? But in the natural basis, the logit basis, 
the prior is proportional to p(l — p), and the posterior probability ratio can 
be estimated by 



0.95 x 0.05 x 0.8 x 0.2 x 0.1 x 0.9 _ 3 
0.24 x 0.76 x 0.008 x 0.992 x 0.19 x 0.81 ~ I' 



(35.15) 



which is not exactly right, but it does illustrate where the preference for A 
is coming from. 



B 




Figure 35.3. Likelihood of the 
mutation rate a on a linear scale 
and log scale, given Luria and 
Delbruck's data. Vertical axis: 
likclihood/10~ 23 ; horizontal axis: 



1 www. inference .phy . cam. ac .uk/ itprnn/code/octave/luriaO .m 
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36 



Decision Theory 



Decision theory is trivial, apart from computational details (just like playing 
chess!). 

You have a choice of various actions, a. The world may be in one of many 
states x; which one occurs may be influenced by your action. The world's 
state has a probability distribution P(x | a). Finally, there is a utility function 
C/(x, a) which specifies the payoff you receive when the world is in state x and 
you chose action a. 

The task of decision theory is to select the action that maximizes the 
expected utility, 



That's all. The computational problem is to maximize £[[/|a] over a. [Pes- 
simists may prefer to define a loss function L instead of a utility function U 
and minimize the expected loss.] 

Is there anything more to be said about decision theory? 

Well, in a real problem, the choice of an appropriate utility function may 
be quite difficult. Furthermore, when a sequence of actions is to be taken, 
with each action providing information about x, we have to take into account 
the effect that this anticipated information may have on our subsequent ac- 
tions. The resulting mixture of forward probability and inverse probability 
computations in a decision problem is distinctive. In a realistic problem such 
as playing a board game, the tree of possible cogitations and actions that must 
be considered becomes enormous, and 'doing the right thing' is not simple, 
because the expected utility of an action cannot be computed exactly (Russell 
and Wefald, 1991; Baum and Smith, 1993; Baum and Smith, 1997). 

Let's explore an example. 

► 36.1 Rational prospecting 

Suppose you have the task of choosing the site for a Tanzanite mine. Your 
final action will be to select the site from a list of N sites. The nth site has 
a net value called the return x n which is initially unknown, and will be found 
out exactly only after site n has been chosen. [x n equals the revenue earned 
from selling the Tanzanite from that site, minus the costs of buying the site, 
paying the staff, and so forth.] At the outset, the return x n has a probability 
distribution P(x n ), based on the information already available. 

Before you take your final action you have the opportunity to do some 
prospecting. Prospecting at the nth site has a cost c„ and yields data d n 
which reduce the uncertainty about x n . [We'll assume that the returns of 




(36.1) 
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the N sites are unrelated to each other, and that prospecting at one site only 
yields information about that site and doesn't affect the return from that site.] 
Your decision problem is: 

given the initial probability distributions P(x\), P(x2), • • • , P(xn), 
first, decide whether to prospect, and at which sites; then, in the 
light of your prospecting results, choose which site to mine. 

For simplicity, let's make everything in the problem Gaussian and focus 
on the question of whether to prospect once or not. We'll assume our utility 
function is linear in x n ; we wish to maximize our expected return. The utility 
function is 

U = x na , (36.2) 

if no prospecting is done, where n a is the chosen 'action' site; and, if prospect- 
ing is done, the utility is 



(36.3) 



where n p is the site at which prospecting took place. 
The prior distribution of the return of site n is 



(36.4) 



(36.5) 



P(x n ) = Normal(a;„; a 2 n ). 
If you prospect at site n, the datum d n is a noisy version of x n : 
P{d n | x n ) = Normal(d„; x n , a 2 ). 

> Exercise 36. 1.^ Given these assumptions, show that the prior probability dis- 
tribution of d n is 

P{d n ) = Normal^; /j, n , a 2 + a 2 n ) (36.6) 

(mnemonic: when independent variables add, variances add), and that 
the posterior distribution of x n given d n is 



where 



P{x„ \d n ) = Normal (x n ; fjf n , o 2 ^ 
d n /a 2 + fi n /a 2 



l/* 2 + l/a 2 



and 



1 



(36.7) 
(36.8) 



(mnemonic: when Gaussians multiply, precisions add). 



To start with let's evaluate the expected utility if we do no prospecting (i.e., 
choose the site immediately); then we'll evaluate the expected utility if we first 
prospect at one site and then make our choice. From these two results we will 
be able to decide whether to prospect once or zero times, and, if we prospect 
once, at which site. 

So, first we consider the expected utility without any prospecting. 

Exercise 36.2. ^ Show that the optimal action, assuming no prospecting, is to 
select the site with biggest mean 



n a = argmax fi n , 

n 

and the expected utility of this action is 

£[U | optimal n] = max jji n . 



(36.9) 



(36.10) 



[If your intuition says 'surely the optimal decision should take into ac- 
count the different uncertainties a n too?', the answer to this question is 
'reasonable - if so, then the utility function should be nonlinear in x\] 



The notation 

P(y) = Normal(t/; fi, a 2 ) indicates 
that y has Gaussian distribution 
with mean /i and variance a 2 . 
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Now the exciting bit. Should we prospect? Once we have prospected at 
site rip, we will choose the site using the decision rule (36.9) with the value of 
mean fi np replaced by the updated value fi' n given by (36.8). What makes the 
problem exciting is that we don't yet know the value of d n , so we don't know 
what our action n a will be; indeed the whole value of doing the prospecting 
comes from the fact that the outcome d n may alter the action from the one 
that we would have taken in the absence of the experimental information. 

From the expression for the new mean in terms of d n (36.8), and the known 
variance of d n (36.6), we can compute the probability distribution of the key 
quantity, fi' n , and can work out the expected utility by integrating over all 
possible outcomes and their associated actions. 

Exercise 36.3.^ Show that the probability distribution of the new mean fi' n 
(36.8) is Gaussian with mean fi n and variance 



r2 ■ 



(36.11) 



Consider prospecting at site n. Let the biggest mean of the other sites be 
Mi- When we obtain the new value of the mean, ii' n , we will choose site n and 
get an expected return of u! n if fi' n > fii, and we will choose site 1 and get an 
expected return of fi\ if fi' n < fi\. 

So the expected utility of prospecting at site n, then picking the best site, 

is 



S 2 ). 



/»oo 

£[U | prospect at n] = —c n + P(fi' n < Mi) Mi + / d(i' n fi' n Normal(^'„; fi n , 

(36.12) 

The difference in utility between prospecting and not prospecting is the 
quantity of interest, and it depends on what we would have done without 
prospecting; and that depends on whether fi± is bigger than fi n . 



£[U | no prospecting] 



f -Mi i 
1 ~Mn i 



if /ii > fl n 
if m < fl n . 



(36.13) 



So 



£[U | prospect at n] — £[U | no prospecting] 

/•oo 

-c„ + / dfi'n {fi'n ~ Ml) Normal(/i^; fi n , s 2 ) if fi\ > fi n 

J Ul / 



(36.14) 



r 1 2 
-Cn + / dfi' n (fn - fi' n ) Normal(^'„; fi n , s ) if A*i < fi n . 

We can plot the change in expected utility due to prospecting (omitting 
Cn) as a function of the difference (fi n — (horizontal axis) and the initial 
standard deviation a n (vertical axis). In the figure the noise variance is a 2 = 1. 




Figure 36.1. Contour plot of the 
gain in expected utility due to 
prospecting. The contours are 
equally spaced from 0.1 to 1.2 in 
steps of 0.1. To decide whether it 
is worth prospecting at site n, find 
the contour equal to c„ (the cost 
of prospecting) ; all points 
[(fi n — Mi)i &n] above that contour 
are worthwhile. 



► 36.2 Further reading 

If the world in which we act is a little more complicated than the prospecting 
problem - for example, if multiple iterations of prospecting are possible, and 
the cost of prospecting is uncertain - then finding the optimal balance between 
exploration and exploitation becomes a much harder computational problem. 
Reinforcement learning addresses approximate methods for this problem (Sut- 
ton and Barto, 1998). 
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► 36.3 Further exercises 

> Exercise 36.4. ^ The four doors problem. 

A new game show uses rules similar to those of the three doors (exer- 
cise 3.8 (p. 57)), but there are four doors, and the host explains: 'First 
you will point to one of the doors, and then I will open one of the other 
doors, guaranteeing to choose a non-winner. Then you decide whether 
to stick with your original pick or switch to one of the remaining doors. 
Then I will open another non- winner (but never the current pick). You 
will then make your final decision by sticking with the door picked on 
the previous decision or by switching to the only other remaining door.' 

What is the optimal strategy? Should you switch on the first opportu- 
nity? Should you switch on the second opportunity? 

> Exercise 36.5. ^ One of the challenges of decision theory is figuring out ex- 

actly what the utility function is. The utility of money, for example, is 
notoriously nonlinear for most people. 

In fact, the behaviour of many people cannot be captured by a coher- 
ent utility function, as illustrated by the Allias paradox, which runs as 
follows. 



Which of these choices do you find most attractive? 

A. £1 million guaranteed. 

B. 89% chance of £1 million; 
10% chance of £2.5 million; 
1% chance of nothing. 

Now consider these choices: 

C. 89% chance of nothing; 
11% chance of £1 million. 

D. 90% chance of nothing; 
10% chance of £2.5 million. 

Many people prefer A to B, and, at the same time, D to C. Prove 
that these preferences are inconsistent with any utility function U(x) 
for money. 

Exercise 36.6. ^ Optimal stopping. 

A large queue of N potential partners is waiting at your door, all asking 
to marry you. They have arrived in random order. As you meet each 
partner, you have to decide on the spot, based on the information so 
far, whether to marry them or say no. Each potential partner has a 
desirability d n , which you find out if and when you meet them. You 
must marry one of them, but you are not allowed to go back to anyone 
you have said no to. 

There are several ways to define the precise problem. 

(a) Assuming your aim is to maximize the desirability d n , i.e., your 
utility function is d n , where h is the partner selected, what strategy 
should you use? 

(b) Assuming you wish very much to marry the most desirable person 
(i.e., your utility function is 1 if you achieve that, and zero other- 
wise); what strategy should you use? 
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(c) Assuming you wish very much to marry the most desirable person, 
and that your strategy will be 'strategy M': 

Strategy M - Meet the first M partners and say no to all 
of them. Memorize the maximum desirability d max among 
them. Then meet the others in sequence, waiting until a 
partner with d n > <i max comes along, and marry them. 
If none more desirable comes along, marry the final iVth 
partner (and feel miserable). 

- what is the optimal value of M? 

Exercise 36.7. ^ Regret as an objective function? 

The preceding exercise (parts b and c) involved a utility function based 
on regret. If one married the tenth most desirable candidate, the utility 
function asserts that one would feel regret for having not chosen the 
most desirable. 

Many people working in learning theory and decision theory use 'mini- 
mizing the maximal possible regret' as an objective function, but does 
this make sense? 

Imagine that Fred has bought a lottery ticket, and offers to sell it to you 
before it's known whether the ticket is a winner. For simplicity say the 
probability that the ticket is a winner is 1/100, and if it is a winner, it 
is worth £10. Fred offers to sell you the ticket for £1. Do you buy it? 

The possible actions are 'buy' and 'don't buy'. The utilities of the four 
possible action-outcome pairs are shown in table 36.2. I have assumed 
that the utility of small amounts of money for you is linear. If you don't 
buy the ticket then the utility is zero regardless of whether the ticket 
proves to be a winner. If you do buy the ticket you either end up losing 
one pound (with probability 99/100) or gaining nine (with probability 
1/100). In the minimax regret community, actions are chosen to mini- 
mize the maximum possible regret. The four possible regret outcomes 
are shown in table 36.3. If you buy the ticket and it doesn't win, you 
have a regret of £1, because if you had not bought it you would have 
been £1 better off. If you do not buy the ticket and it wins, you have 
a regret of £9, because if you had bought it you would have been £9 
better off. The action that minimizes the maximum possible regret is 
thus to buy the ticket. 

Discuss whether this use of regret to choose actions can be philosophi- 
cally justified. 

The above problem can be turned into an investment portfolio decision 
problem by imagining that you have been given one pound to invest in 
two possible funds for one day: Fred's lottery fund, and the cash fund. If 
you put £fi into Fred's lottery fund, Fred promises to return £9f\ to you 
if the lottery ticket is a winner, and otherwise nothing. The remaining 
£fo (with /rj = 1 — /i) is kept as cash. What is the best investment? 
Show that the minimax regret community will invest /i = 9/10 of their 
money in the high risk, high return lottery fund, and only /o = 1/10 in 
cash. Can this investment method be justified? 

Exercise 36.8. Gambling oddities (from Cover and Thomas (1991)). A horse 
race involving / horses occurs repeatedly, and you are obliged to bet 
all your money each time. Your bet at time t can be represented by 



Outcome 



Action 
Buy Don't 
buy 



No win 
Wins 



-1 
+9 



Table 36.2. Utility in the lottery 
ticket problem. 



Outcome 



Action 
Buy Don't 
buy 



No win 1 
Wins 



Table 36.3. Regret in the lottery 
ticket problem. 
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a normalized probability vector b multiplied by your money m(t). The 
odds offered by the bookies are such that if horse i wins then your return 
is m(t+l) = biOiin(t). Assuming the bookies' odds are 'fair', that is, 



and assuming that the probability that horse i wins is pi, work out the 
optimal betting strategy if your aim is Cover's aim, namely, to maximize 
the expected value of log m{T). Show that the optimal strategy sets b 
equal to p, independent of the bookies' odds o. Show that when this 
strategy is used, the money is expected to grow exponentially as: 



where W = Y^iPi lo S 

If you only bet once, is the optimal strategy any different? 

Do you think this optimal strategy makes sense? Do you think that it's 
'optimal', in common language, to ignore the bookies' odds? What can 
you conclude about 'Cover's aim' ? 

Exercise 36.9. ^ Two ordinary dice are thrown repeatedly; the outcome of 
each throw is the sum of the two numbers. Joe Shark, who says that 6 
and 8 are his lucky numbers, bets even money that a 6 will be thrown 
before the first 7 is thrown. If you were a gambler, would you take the 
bet? What is your probability of winning? Joe then bets even money 
that an 8 will be thrown before the first 7 is thrown. Would you take 
the bet? 

Having gained your confidence, Joe suggests combining the two bets into 
a single bet: he bets a larger sum, still at even odds, that an 8 and a 
6 will be thrown before two 7s have been thrown. Would you take the 
bet? What is your probability of winning? 




(36.15) 



(36.16) 
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Bayesian Inference and Sampling Theory 



There are two schools of statistics. Sampling theorists concentrate on having 
methods guaranteed to work most of the time, given minimal assumptions. 
Bayesians try to make inferences that take into account all available informa- 
tion and answer the question of interest given the particular data set. As you 
have probably gathered, I strongly recommend the use of Bayesian methods. 

Sampling theory is the widely used approach to statistics, and most pa- 
pers in most journals report their experiments using quantities like confidence 
intervals, significance levels, and p- values. A p- value (e.g. p = 0.05) is the prob- 
ability, given a null hypothesis for the probability distribution of the data, that 
the outcome would be as extreme as, or more extreme than, the observed out- 
come. Untrained readers - and perhaps, more worryingly, the authors of many 
papers - usually interpret such a p- value as if it is a Bayesian probability (for 
example, the posterior probability of the null hypothesis), an interpretation 
that both sampling theorists and Bayesians would agree is incorrect. 

In this chapter we study a couple of simple inference problems in order to 
compare these two approaches to statistics. 

While in some cases, the answers from a Bayesian approach and from sam- 
pling theory are very similar, we can also find cases where there are significant 
differences. We have already seen such an example in exercise 3.15 (p. 59), 
where a sampling theorist got a p-value smaller than 7%, and viewed this as 
strong evidence against the null hypothesis, whereas the data actually favoured 
the null hypothesis over the simplest alternative. On p. 64, another example 
was given where the p- value was smaller than the mystical value of 5%, yet the 
data again favoured the null hypothesis. Thus in some cases, sampling theory 
can be trigger-happy, declaring results to be 'sufficiently improbable that the 
null hypothesis should be rejected', when those results actually weakly sup- 
port the null hypothesis. As we will now see, there are also inference problems 
where sampling theory fails to detect 'significant' evidence where a Bayesian 
approach and everyday intuition agree that the evidence is strong. Most telling 
of all are the inference problems where the 'significance' assigned by sampling 
theory changes depending on irrelevant factors concerned with the design of 
the experiment. 

This chapter is only provided for those readers who are curious about the 
sampling theory / Bayesian methods debate. If you find any of this chapter 
tough to understand, please skip it. There is no point trying to understand 
the debate. Just use Bayesian methods - they are much easier to understand 
than the debate itself! 
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► 37.1 A medical example 

We are trying to reduce the incidence of an unpleasant disease 
called microsoftus. Two vaccinations, A and B, are tested on 
a group of volunteers. Vaccination B is a control treatment, a 
placebo treatment with no active ingredients. Of the 40 subjects, 
30 are randomly assigned to have treatment A and the other 10 
are given the control treatment B. We observe the subjects for one 
year after their vaccinations. Of the 30 in group A, one contracts 
microsoftus. Of the 10 in group B, three contract microsoftus. 

Is treatment A better than treatment Bl 
Sampling theory has a go 

The standard sampling theory approach to the question 'is A better than B?' 
is to construct a statistical test. The test usually compares a hypothesis such 
as 

Ji\: 'A and B have different effectivenesses' 

with a null hypothesis such as 

7^0 : 'A and B have exactly the same effectivenesses as each other'. 

A novice might object 'no, no, I want to compare the hypothesis "A is better 
than B" with the alternative "B is better than A"!' but such objections are 
not welcome in sampling theory. 

Once the two hypotheses have been defined, the first hypothesis is scarcely 
mentioned again - attention focuses solely on the null hypothesis. It makes me 
laugh to write this, but it's true! The null hypothesis is accepted or rejected 
purely on the basis of how unexpected the data were to 7^o, not on how much 
better Tii predicted the data. One chooses a statistic which measures how 
much a data set deviates from the null hypothesis. In the example here, the 
standard statistic to use would be one called \ 2 (chi-squared). To compute 
X 2 , we take the difference between each data measurement and its expected 
value assuming the null hypothesis to be true, and divide the square of that 
difference by the variance of the measurement, assuming the null hypothesis to 
be true. In the present problem, the four data measurements are the integers 
F A+ , F a _, Fb+, and F B _, that is, the number of subjects given treatment A 
who contracted microsoftus (Fa+), the number of subjects given treatment A 
who didn't (Fa-), and so forth. The definition of x 2 i s: 



x 2 = ^(^-<^>r 



(37.1) 



Actually, in my elementary statistics book (Spiegel, 1988) I find Yates's cor- 
rection is recommended: 



E 



(|F,-(^)|-0.5) 2 
(Fi) 



(37.2) 



In this case, given the null hypothesis that treatments A and B are equally 
effective, and have rates /+ and /_ for the two outcomes, the expected counts 
are: 

(F A+ )=f+N A (F A -)=f-N A 
(F B+ )=f+N B (F B -)=f-N B . 



(37.3) 



If you want to know about Yates's 
correction, read a sampling theory 
textbook. The point of this 
chapter is not to teach sampling 
theory; I merely mention Yates's 
correction because it is what a 
professional sampling theorist 
might use. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



37.1: A medical example 



459 



The test accepts or rejects the null hypothesis on the basis of how big \ 2 is. 
To make this test precise, and give it a 'significance level', we have to work 
out what the sampling distribution of x 2 is, taking into account the fact that 
the four data points are not independent (they satisfy the two constraints 
F A+ + F A _ = N A and F B + + F B _ = N B ) and the fact that the parameters 
f± are not known. These three constraints reduce the number of degrees 
of freedom in the data from four to one. [If you want to learn more about 
computing the 'number of degrees of freedom', read a sampling theory book; in 
Bayesian methods we don't need to know all that, and quantities equivalent to 
the number of degrees of freedom pop straight out of a Bayesian analysis when 
they are appropriate.] These sampling distributions are tabulated by sampling 
theory gnomes and come accompanied by warnings about the conditions under 
which they are accurate. For example, standard tabulated distributions for x 2 
are only accurate if the expected numbers Fi are about 5 or more. 

Once the data arrive, sampling theorists estimate the unknown parameters 
f± of the null hypothesis from the data: 



A 



F A+ + Fb+ 
N A + N B ' 



A 



F A - + F B - 
N A + N B ' 



(37.4) 



and evaluate x 2 ■ At this point, the sampling theory school divides itself into 
two camps. One camp uses the following protocol: first, before looking at the 
data, pick the significance level of the test (e.g. 5%), and determine the critical 
value of x 2 above which the null hypothesis will be rejected. (The significance 
level is the fraction of times that the statistic x 2 would exceed the critical 
value, if the null hypothesis were true.) Then evaluate x 2 , compare with the 
critical value, and declare the outcome of the test, and its significance level 
(which was fixed beforehand). 

The second camp looks at the data, finds x 2 , then looks in the table of 
X 2 -distributions for the significance level, p, for which the observed value of x 2 
would be the critical value. The result of the test is then reported by giving 
this value of p, which is the fraction of times that a result as extreme as the one 
observed, or more extreme, would be expected to arise if the null hypothesis 
were true. 

Let's apply these two methods. First camp: let's pick 5% as our signifi- 
cance level. The critical value for x 2 with one degree of freedom is Xo 05 = 3.84. 
The estimated values of f± are 



The sampling distribution of a 
statistic is the probability 
distribution of its value under 
repetitions of the experiment, 
assuming that the null hypothesis 
is true. 



/+ = 1/10, /_ = 9/10. 



(37.5) 



The expected values of the four measurements are 



(Fa+) 
(Fa-) 
(Fb+) 
(Fb-) 



3 

27 
1 

9 



(37.6) 
(37.7) 
(37.8) 
(37.9) 



and x 2 ( as defined in equation (37.1)) is 

X 2 = 5.93. 



(37.10) 



Since this value exceeds 3.84, we reject the null hypothesis that the two treat- 
ments are equivalent at the 0.05 significance level. However, if we use Yates's 
correction, we find x 2 = 3.33, and therefore accept the null hypothesis. 
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Camp two runs a finger across the % 2 table found at the back of any good 
sampling theory book and finds \\o = 2.71. Interpolating between x\o an d 
X 2 05 , camp two reports 'the p- value is p = 0.07'. 

Notice that this answer does not say how much more effective A is than B, 
it simply says that A is 'significantly' different from B. And here, 'significant' 
means only 'statistically significant', not practically significant. 

The man in the street, reading the statement that 'the treatment was sig- 
nificantly different from the control (p = 0.07)', might come to the conclusion 
that 'there is a 93% chance that the treatments differ in effectiveness'. But 
what L p = 0.07' actually means is 'if you did this experiment many times, and 
the two treatments had equal effectiveness, then 7% of the time you would 
find a value of \ 2 more extreme than the one that happened here'. This has 
almost nothing to do with what we want to know, which is how likely it is 
that treatment A is better than B. 

Let me through, I'm a Bayesian 

OK, now let's infer what we really want to know. We scrap the hypothesis 
that the two treatments have exactly equal effectivenesses, since we do not 
believe it. There are two unknown parameters, pa+ and pb+, which are the 
probabilities that people given treatments A and B, respectively, contract the 
disease. 

Given the data, we can infer these two probabilities, and we can answer 
questions of interest by examining the posterior distribution. 
The posterior distribution is 

P(PA + , PB+ | {Fi}) = WmPM^fii^ll . (37.11) 

The likelihood function is 

P({F i }\p A+ ,p B+ ) = (^)pK-(^)pM- (37.12) 



^y A+P T^)p 3 B + P 7 B- (37.13) 



What prior distribution should we use? The prior distribution gives us the 
opportunity to include knowledge from other experiments, or a prior belief 
that the two parameters pa+ and Pb+, while different from each other, are 
expected to have similar values. 

Here we will use the simplest vanilla prior distribution, a uniform distri- 
bution over each parameter. 

P(PA+,PB+) = 1. (37.14) 

We can now plot the posterior distribution. Given the assumption of a sepa- 
rable prior on p A + and pb+, the posterior distribution is also separable: 

P{p A+ ,PB+\{Fi}) = P(p A+ \F A+ ,F A -)P(pB + \F B+ ,F B -). (37.15) 

The two posterior distributions are shown in figure 37.1 (except the graphs 
are not normalized) and the joint posterior probability is shown in figure 37.2. 

If we want to know the answer to the question 'how probable is it that p A + 
is smaller than p B +l\ we can answer exactly that question by computing the 
posterior probability 

P(PA+ <p B+ |Data), (37.16) 
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0.2 



0.4 



0.6 



0.8 



Figure 37.1. Posterior 
probabilities of the two 
effectivenesses. Treatment A 
solid line; B - dotted line. 




Figure 37.2. Joint posterior 
probability of the two 
effectivenesses - contour plot and 
surface plot. 



which is the integral of the joint posterior probability P{pa+,Pb+ I Data) 
shown in figure 37.2 over the region in which pa+ < Pb+, i-e., the shaded 
triangle in figure 37.3. The value of this integral (obtained by a straightfor- 
ward numerical integration of the likelihood function (37.13) over the relevant 
region) is P(pa+ <Pb+ I Data) = 0.990. 

Thus there is a 99% chance, given the data and our prior assumptions, 
that treatment A is superior to treatment B. In conclusion, according to our 
Bayesian model, the data (1 out of 30 contracted the disease after vaccination 
A, and 3 out of 10 contracted the disease after vaccination B) give very strong 
evidence - about 99 to one - that treatment A is superior to treatment B. 

In the Bayesian approach, it is also easy to answer other relevant questions. 
For example, if we want to know 'how likely is it that treatment A is ten times 
more effective than treatment B?', we can integrate the joint posterior proba- 
bility P(pa+,Pb+ | Data) over the region in which pa+ < Wpb+ (figure 37.4). 




PA+ 



Figure 37.3. The proposition 
PA+ < Pb+ is true for all points in 
the shaded triangle. To find the 
probability of this proposition wc 
integrate the joint posterior 
probability P(pa+,Pb+ | Data) 
(figure 37.2) over this region. 



Model comparison 

If there were a situation in which we really did want to compare the two 
hypotheses Ho: pa+ = Pb+ and Hi: pa+ Pb+, we can of course do this 
directly with Bayesian methods also. 
As an example, consider the data set: 

D: One subject, given treatment A, subsequently contracted microsoftus. 
One subject, given treatment B, did not. 



Treatment A B 

Got disease 1 
Did not 1 




PA+ 



Figure 37.4. The proposition 

PA+ < Wpb+ is true for all points 

in the shaded triangle. 



Total treated 1 1 
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How strongly does this data set favour Hi over TLqI 

We answer this question by computing the evidence for each hypothesis. 
Let's assume uniform priors over the unknown parameters of the models. The 
first hypothesis 7io- Pa+ = Pb+ has just one unknown parameter, let's call it 
V- 

P(p\n ) = l PG(0,1). (37.17) 

We'll use the uniform prior over the two parameters of model 7i\ that we used 
before: 

P(PA+,PB+\Hi) = 1 p A+ £ (0,1), p B+ £ (0,1). (37.18) 

Now, the probability of the data D under model 7irj is the normalizing constant 
from the inference of p given D: 

P(D\H ) = JdpP(D\p)P(p\H ) (37.19) 

= Jdpp(l-p)xl (37.20) 
= 1/6. (37.21) 

The probability of the data D under model Hi is given by a simple two- 
dimensional integral: 

P(D\U l ) = J Jdp A+ dp B+ P(D\p A+ ,p B+ )P( PA+ ,p B+ \n 1 ) (37.22) 

= j dp A +PA+ Jdp B+ (l-p B +) (37.23) 

= 1/2 x 1/2 (37.24) 
= 1/4. (37.25) 

Thus the evidence ratio in favour of model Hi, which asserts that the two 
effectivenesses are unequal, is 

P(D\H 1 ) 1/4 0.6 

P(D\H ) ~ 1/6 ~ 0.4' {6{ - Zb) 

So if the prior probability over the two hypotheses was 50:50, the posterior 
probability is 60:40 in favour of Hi. □ 
Is it not easy to get sensible answers to well-posed questions using Bayesian 
methods? 

[The sampling theory answer to this question would involve the identical 
significance test that was used in the preceding problem; that test would yield 
a 'not significant' result. I think it is greatly preferable to acknowledge what 
is obvious to the intuition, namely that the data D do give weak evidence in 
favour of H\. Bayesian methods quantify how weak the evidence is.] 

► 37.2 Dependence of p-values on irrelevant information 

In an expensive laboratory, Dr. Bloggs tosses a coin labelled a and b twelve 
times and the outcome is the string 

aaabaaaabaab, 



which contains three 6s and nine as. 

What evidence do these data give that the coin is biased in favour of a? 
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37.2: Dependence ofp-values on irrelevant information 

Dr. Bloggs consults his sampling theory friend who says 'let r be the num- 
ber of 6s and n = 12 be the total number of tosses; I view r as the random 
variable and find the probability of r taking on the value r = 3 or a more 
extreme value, assuming the null hypothesis p a = 0.5 to be true'. He thus 
computes 

P(r<3|n = 12, Wo ) = £ h V2" = (ft) + (\ 2 ) + ft) + ft)) 

r=0 ^ ' 

= 0.07, (37.27) 

and reports 'at the significance level of 5%, there is not significant evidence 
of bias in favour of a'. Or, if the friend prefers to report p- values rather than 
simply compare p with 5%, he would report 'the p-value is 7%, which is not 
conventionally viewed as significantly small'. If a two-tailed test seemed more 
appropriate, he might compute the two-tailed area, which is twice the above 
probability, and report 'the p-value is 15%, which is not significantly small'. 
We won't focus on the issue of the choice between the one-tailed and two-tailed 
tests, as we have bigger fish to catch. 

Dr. Bloggs pays careful attention to the calculation (37.27), and responds 
'no, no, the random variable in the experiment was not r: I decided before 
running the experiment that I would keep tossing the coin until I saw three 
6s; the random variable is thus n\ 

Such experimental designs arc not unusual. In my experiments on error- 
correcting codes I often simulate the decoding of a code until a chosen number 
r of block errors (&s) has occurred, since the error on the inferred value of logpb 
goes roughly as ^/r, independent of n. 

Exercise 37. 1.^ Find the Bayesian inference about the bias p a of the coin 
•M- given the data, and determine whether a Bayesian's inferences depend 

on what stopping rule was in force. 

According to sampling theory, a different calculation is required in order 
to assess the 'significance' of the result n = 12. The probability distribution 
of n given Ho is the probability that the first n— 1 tosses contain exactly r— 1 
6s and then the nth toss is a 6. 

P(n\H ,r)= ("lJ)V2 n . (37.28) 

The sampling theorist thus computes 

P{n> 12 I r = 3, Wo) = 0.03. (37.29) 

He reports back to Dr. Bloggs, 'the p- value is 3% - there is significant evidence 
of bias after all!' 

What do you think Dr. Bloggs should do? Should he publish the result, 
with this marvellous p-value, in one of the journals that insists that all exper- 
imental results have their 'significance' assessed using sampling theory? Or 
should he boot the sampling theorist out of the door and seek a coherent 
method of assessing significance, one that does not depend on the stopping 
rule? 

At this point the audience divides in two. Half the audience intuitively 
feel that the stopping rule is irrelevant, and don't need any convincing that 
the answer to exercise 37.1 (p. 463) is 'the inferences about p a do not depend 
on the stopping rule'. The other half, perhaps on account of a thorough 
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training in sampling theory, intuitively feel that Dr. Bloggs's stopping rule, 
which stopped tossing the moment the third b appeared, may have biased the 
experiment somehow. If you are in the second group, I encourage you to reflect 
on the situation, and hope you'll eventually come round to the view that is 
consistent with the likelihood principle, which is that the stopping rule is not 
relevant to what we have learned about p a . 

As a thought experiment, consider some onlookers who (in order to save 
money) are spying on Dr. Bloggs's experiments: each time he tosses the coin, 
the spies update the values of r and n. The spies are eager to make inferences 
from the data as soon as each new result occurs. Should the spies' beliefs 
about the bias of the coin depend on Dr. Bloggs's intentions regarding the 
continuation of the experiment? 

The fact that the p- values of sampling theory do depend on the stopping 
rule (indeed, whole volumes of the sampling theory literature are concerned 
with the task of assessing 'significance' when a complicated stopping rule is 
required - 'sequential probability ratio tests', for example) seems to me a com- 
pelling argument for having nothing to do with p-values at all. A Bayesian 
solution to this inference problem was given in sections 3.2 and 3.3 and exer- 
cise 3.15 (p. 59). 

Would it help clarify this issue if I added one more scene to the story? 
The janitor, who's been eavesdropping on Dr. Bloggs's conversation, comes in 
and says 'I happened to notice that just after you stopped doing the experi- 
ments on the coin, the Officer for Whimsical Departmental Rules ordered the 
immediate destruction of all such coins. Your coin was therefore destroyed by 
the departmental safety officer. There is no way you could have continued the 
experiment much beyond n = 12 tosses. Seems to me, you need to recompute 
your p- value?' 

► 37.3 Confidence intervals 

In an experiment in which data D are obtained from a system with an unknown 
parameter 8, a standard concept in sampling theory is the idea of a confidence 
interval for 6. Such an interval (# m in(-D), S max (D)) has associated with it a 
confidence level such as 95% which is informally interpreted as 'the probability 
that 8 lies in the confidence interval'. 

Let's make precise what the confidence level really means, then give an 
example. A confidence interval is a function (9 m i n (D),0 max (D)) of the data 
set D. The confidence level of the confidence interval is a property that we can 
compute before the data arrive. We imagine generating many data sets from a 
particular true value of 8, and calculating the interval (O m i n (D), 8 max (D)), and 
then checking whether the true value of 8 lies in that interval. If, averaging 
over all these imagined repetitions of the experiment, the true value of 8 lies 
in the confidence interval a fraction / of the time, and this property holds for 
all true values of 8, then the confidence level of the confidence interval is /. 

For example, if 8 is the mean of a Gaussian distribution which is known 
to have standard deviation 1, and D is a sample from that Gaussian, then 
(6min(D),8 max (D)) = (D-2,D+2) is a 95% confidence interval for 8. 

Let us now look at a simple example where the meaning of the confidence 
level becomes clearer. Let the parameter 8 be an integer, and let the data be 
a pair of points X\,X2, drawn independently from the following distribution: 




for other values of x. 



(37.30) 
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For example, if 9 were 39, then we could expect the following data sets: 

(37.31) 



D = (xi,x 2 ) = (39, 39) with probability l k 

(xi,x 2 ) = (39,40) with probability l k 

{xi,x 2 ) = (40, 39) with probability l k 

{xi,x 2 ) = (40,40) with probability l k. 



We now consider the following confidence interval: 

[0min(-D),0max(-D)] = [min^, x 2 ), min(xi, x 2 )]. (37.32) 
For example, if (xi,x 2 ) = (40, 39), then the confidence interval for 9 would be 

[0minOD),0ma*(£>)] = [39,39]. 

Let's think about this confidence interval. What is its confidence level? 
By considering the four possibilities shown in (37.31), we can see that there 
is a 75% chance that the confidence interval will contain the true value. The 
confidence interval therefore has a confidence level of 75%, by definition. 

Now, what if the data we acquire are (x\,x 2 ) = (29,29)? Well, we can 
compute the confidence interval, and it is [29,29]. So shall we report this 
interval, and its associated confidence level, 75%? This would be correct 
by the rules of sampling theory. But does this make sense? What do we 
actually know in this case? Intuitively, or by Bayes' theorem, it is clear that 9 
could either be 29 or 28, and both possibilities are equally likely (if the prior 
probabilities of 28 and 29 were equal). The posterior probability of 9 is 50% 
on 29 and 50% on 28. 

What if the data are {x\,x 2 ) = (29,30)? In this case, the confidence 
interval is still [29,29], and its associated confidence level is 75%. But in this 
case, by Bayes' theorem, or common sense, we are 100% sure that 9 is 29. 

In neither case is the probability that 9 lies in the '75% confidence interval' 
equal to 75%! 

Thus 

1 . the way in which many people interpret the confidence levels of sampling 
theory is incorrect; 

2. given some data, what people usually want to know (whether they know 
it or not) is a Bayesian posterior probability distribution. 

Are all these examples contrived? Am I making a fuss about nothing? 
If you are sceptical about the dogmatic views I have expressed, I encourage 
you to look at a case study: look in depth at exercise 35.4 (p. 446) and the 
reference (Kepler and Oprea, 2001), in which sampling theory estimates and 
confidence intervals for a mutation rate are constructed. Try both methods 
on simulated data - the Bayesian approach based on simply computing the 
likelihood function, and the confidence interval from sampling theory; and let 
me know if you don't find that the Bayesian answer is always better than the 
sampling theory answer; and often much, much better. This suboptimality 
of sampling theory, achieved with great effort, is why I am passionate about 
Bayesian methods. Bayesian methods are straightforward, and they optimally 
use all the information in the data. 



► 37.4 Some compromise positions 

Let's end on a conciliatory note. Many sampling theorists are pragmatic - 
they are happy to choose from a selection of statistical methods, choosing 
whichever has the 'best' long-run properties. In contrast, I have no problem 
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with the idea that there is only one answer to a well-posed problem; but it's 
not essential to convert sampling theorists to this viewpoint: instead, we can 
offer them Bayesian estimators and Bayesian confidence intervals, and request 
that the sampling theoretical properties of these methods be evaluated. We 
don't need to mention that the methods are derived from a Bayesian per- 
spective. If the sampling properties are good then the pragmatic sampling 
theorist will choose to use the Bayesian methods. It is indeed the case that 
many Bayesian methods have good sampling-theoretical properties. Perhaps 
it's not surprising that a method that gives the optimal answer for each indi- 
vidual case should also be good in the long run! 

Another piece of common ground can be conceded: while I believe that 
most well-posed inference problems have a unique correct answer, which can 
be found by Bayesian methods, not all problems are well-posed. A common 
question arising in data modelling is 'am I using an appropriate model?' Model 
criticism, that is, hunting for defects in a current model, is a task that may 
be aided by sampling theory tests, in which the null hypothesis ('the current 
model is correct') is well defined, but the alternative model is not specified. 
One could use sampling theory measures such as p- values to guide one's search 
for the aspects of the model most in need of scrutiny. 

Further reading 

My favourite reading on this topic includes (Jaynes, 1983; Gull, 1988; Loredo, 
1990; Berger, 1985; Jaynes, 2003). Treatises on Bayesian statistics from the 
statistics community include (Box and Tiao, 1973; O'Hagan, 1994). 

► 37.5 Further exercises 

> Exercise 37.2. [3C] A traffic survey records traffic on two successive days. On 

Friday morning, there are 12 vehicles in one hour. On Saturday morn- 
ing, there are 9 vehicles in half an hour. Assuming that the vehicles are 
Poisson distributed with rates Af and As (in vehicles per hour) respec- 
tively, 

(a) is As greater than \p? 

(b) by what factor is As bigger or smaller than Aj?? 

> Exercise 37.3. Write a program to compare treatments A and B given 

data Fa+, Fa-, Fg + , Fg_ as described in section 37.1. The outputs 
of the program should be (a) the probability that treatment A is more 
effective than treatment B; (b) the probability that pa+ < 10ps+', (c) 
the probability that ps+ < WpA+- 
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Part V 

Neural networks 
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38 



Introduction to Neural Networks 



In the field of neural networks, we study the properties of networks of idealized 
'neurons'. 

Three motivations underlie work in this broad and interdisciplinary field. 

Biology. The task of understanding how the brain works is one of the out- 
standing unsolved problems in science. Some neural network models are 
intended to shed light on the way in which computation and memory 
are performed by brains. 

Engineering. Many researchers would like to create machines that can 'learn', 
perform 'pattern recognition' or 'discover patterns in data'. 

Complex systems. A third motivation for being interested in neural net- 
works is that they are complex adaptive systems whose properties are 
interesting in their own right. 

I should emphasize several points at the outset. 

• This book gives only a taste of this field. There are many interesting 
neural network models which we will not have time to touch on. 

• The models that we discuss are not intended to be faithful models of 
biological systems. If they are at all relevant to biology, their relevance 
is on an abstract level. 

• I will describe some neural network methods that are widely used in 
nonlinear data modelling, but I will not be able to give a full description 
of the state of the art. If you wish to solve real problems with neural 
networks, please read the relevant papers. 

► 38.1 Memories 

In the next few chapters we will meet several neural network models which 
come with simple learning algorithms which make them function as memories. 
Perhaps we should dwell for a moment on the conventional idea of memory 
in digital computation. A memory (a string of 5000 bits describing the name 
of a person and an image of their face, say) is stored in a digital computer 
at an address. To retrieve the memory you need to know the address. The 
address has nothing to do with the memory itself. Notice the properties that 
this scheme does not have: 

1. Address-based memory is not associative. Imagine you know half of a 
memory, say someone's face, and you would like to recall the rest of the 
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memory - their name. If your memory is address-based then you can't 
get at a memory without knowing the address. [Computer scientists have 
devoted effort to wrapping traditional address-based memories inside 
cunning software to produce content-addressable memories, but content- 
addressability does not come naturally. It has to be added on.] 

2. Address-based memory is not robust or fault-tolerant. If a one- bit mis- 
take is made in specifying the address then a completely different mem- 
ory will be retrieved. If one bit of a memory is flipped then whenever 
that memory is retrieved the error will be present. Of course, in all mod- 
ern computers, error-correcting codes are used in the memory, so that 
small numbers of errors can be detected and corrected. But this error- 
tolerance is not an intrinsic property of the memory system. If minor 
damage occurs to certain hardware that implements memory retrieval, 
it is likely that all functionality will be catastrophically lost. 

3. Address-based memory is not distributed. In a serial computer that 
is accessing a particular memory, only a tiny fraction of the devices 
participate in the memory recall: the CPU and the circuits that are 
storing the required byte. All the other millions of devices in the machine 
are sitting idle. 

Are there models of truly parallel computation, in which multiple de- 
vices participate in all computations? [Present-day parallel computers 
scarcely differ from serial computers from this point of view. Memory 
retrieval works in just the same way, and control of the computation 
process resides in CPUs. There are simply a few more CPUs. Most of 
the devices sit idle most of the time.] 

Biological memory systems are completely different. 

1. Biological memory is associative. Memory recall is content-addressable. 
Given a person's name, we can often recall their face; and vice versa. 
Memories are apparently recalled spontaneously, not just at the request 
of some CPU. 

2. Biological memory recall is error-tolerant and robust. 

• Errors in the cues for memory recall can be corrected. An example 
asks you to recall 'An American politician who was very intelligent 
and whose politician father did not like broccoli'. Many people 
think of president Bush - even though one of the cues contains an 
error. 

• Hardware faults can also be tolerated. Our brains are noisy lumps 
of meat that are in a continual state of change, with cells being 
damaged by natural processes, alcohol, and boxing. While the cells 
in our brains and the proteins in our cells are continually changing, 
many of our memories persist unaffected. 

3. Biological memory is parallel and distributed - not completely distributed 
throughout the whole brain: there does appear to be some functional 
specialization - but in the parts of the brain where memories are stored, 
it seems that many neurons participate in the storage of multiple mem- 
ories. 



These properties of biological memory systems motivate the study of 'arti- 
ficial neural networks' - parallel distributed computational systems consisting 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 

470 38 — Introduction to Neural Networks 



of many interacting simple elements. The hope is that these model systems 
might give some hints as to how neural computation is achieved in real bio- 
logical neural networks. 

► 38.2 Terminology 

Each time we describe a neural network algorithm we will typically specify 
three things. [If any of this terminology is hard to understand, it's probably 
best to dive straight into the next chapter.] 

Architecture. The architecture specifies what variables are involved in the 
network and their topological relationships - for example, the variables 
involved in a neural net might be the weights of the connections between 
the neurons, along with the activities of the neurons. 

Activity rule. Most neural network models have short time-scale dynamics: 
local rules define how the activities of the neurons change in response 
to each other. Typically the activity rule depends on the weights (the 
parameters) in the network. 

Learning rule. The learning rule specifies the way in which the neural net- 
work's weights change with time. This learning is usually viewed as 
taking place on a longer time scale than the time scale of the dynamics 
under the activity rule. Usually the learning rule will depend on the 
activities of the neurons. It may also depend on the values of target 
values supplied by a teacher and on the current value of the weights. 

Where do these rules come from? Often, activity rules and learning rules are 
invented by imaginative researchers. Alternatively, activity rules and learning 
rules may be derived from carefully chosen objective functions. 

Neural network algorithms can be roughly divided into two classes. 

Supervised neural networks are given data in the form of inputs and tar- 
gets, the targets being a teacher's specification of what the neural net- 
work's response to the input should be. 

Unsupervised neural networks are given data in an undivided form - sim- 
ply a set of examples {x}. Some learning algorithms are intended simply 
to memorize these data in such a way that the examples can be recalled 
in the future. Other algorithms are intended to 'generalize', to discover 
'patterns' in the data, or extract the underlying 'features' from them. 

Some unsupervised algorithms are able to make predictions - for exam- 
ple, some algorithms can 'fill in' missing variables in an example x - and 
so can also be viewed as supervised networks. 
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The Single Neuron cLS cL Classifier 



► 39.1 The single neuron 

We will study a single neuron for two reasons. First, many neural network 
models are built out of single neurons, so it is good to understand them in 
detail. And second, a single neuron is itself capable of 'learning' - indeed, 
various standard statistical methods can be viewed in terms of single neurons 
- so this model will serve as a first example of a supervised neural network. 

Definition of a single neuron 

We will start by defining the architecture and the activity rule of a single 
neuron, and we will then derive a learning rule. 

Architecture. A single neuron has a number / of inputs Xi and one output 
which we will here call y. (See figure 39.1.) Associated with each input 
is a weight Wj (i = I,..., I). There may be an additional parameter 
wo of the neuron called a bias which we may view as being the weight 
associated with an input xq which is permanently set to 1. The single 
neuron is a feedforward device - the connections are directed from the 
inputs to the output of the neuron. 

Activity rule. The activity rule has two steps. 

1. First, in response to the imposed inputs x, we compute the activa- 
tion of the neuron, 

a = ^2wiXi, (39.1) 




WW \ \Wl 

Xi ■ ■ ■ Xj 

Figure 39.1. A single neuron 



where the sum is over i = 0, . . . , / if there is a bias and i = 
otherwise. 

2. Second, the output y is set as a function f(a) of the activation. 
The output is also called the activity of the neuron, not to be 
confused with the activation a. There are several possible activation 
functions; here are the most popular. 

(a) Deterministic activation functions: 

i. Linear. 

y(o) = a. (39.2) 

ii. Sigmoid (logistic function). 

1 



y(a) 



(ye (o,i)). 



(39.3) 



activation 
a 



activity 

y(a) 



471 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



472 



39 — The Single Neuron as a Classifier 



iii. Sigmoid (tanh). 

y{o) = tanh(o) (ye (-1,1)). 

iv. Threshold function. 

1 a > 



y(a) = 0(a) = 



-1 a < 0. 



(39.4) 



(39.5) 



(b) Stochastic activation functions: y is stochastically selected from 
±1. 

i. Heat bath. 



1 with probability 
— 1 otherwise. 



1 + e~ a 



(39.6) 



ii. The Metropolis rule produces the output in a way that 
depends on the previous output state y: 

Compute A = ay 

If A < 0, flip y to the other state 

Else flip y to the other state with probability e~ A . 



► 39.2 Basic neural network concepts 

A neural network implements a function y(x;w); the 'output' of the network, 
y, is a nonlinear function of the 'inputs' x; this function is parameterized by 
'weights' w. 

We will study a single neuron which produces an output between and 1 
as the following function of x: 



y(x;w) 



1 



l + e- 



(39.7) 



Exercise 39.1.^ In what contexts have we encountered the function y(x; w) = 
1/(1 + e- w x ) already? 



Motivations for the linear logistic function 

In section 11.2 we studied 'the best detection of pulses', assuming that one 
of two signals xrj and xi had been transmitted over a Gaussian channel with 
variance-covariance matrix A -1 . We found that the probability that the source 
signal was s = 1 rather than s = 0, given the received signal y, was 



P(s = l\y) = 



1 + exp(-o(y)) ' 
where a(y) was a linear function of the received vector, 

a(y) = w T y + 6, 



(39.8) 



(39.9) 



with w = A(xi — xo). 

The linear logistic function can be motivated in several other ways - see 
the exercises. 
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Figure 39.2. Output of a simple 
neural network as a function of its 
input. 



w= (0,2) 



Input space and weight space 

For convenience let us study the case where the input vector x and the param- 
eter vector w are both two-dimensional: x = (xi,X2), w = {w\,W2). Then we 
can spell out the function performed by the neuron thus: 

w(x;w) = — ■ r. (39.10) 

v ' \ _|_ e — (W1X1+W2X2) v ' 

Figure 39.2 shows the output of the neuron as a function of the input vector, 
for w = (0, 2). The two horizontal axes of this figure are the inputs x\ and X2, 
with the output y on the vertical axis. Notice that on any line perpendicular 
to w, the output is constant; and along a line in the direction of w, the output 
is a sigmoid function. 

We now introduce the idea of weight space, that is, the parameter space of 
the network. In this case, there are two parameters u>\ and W2, so the weight 
space is two dimensional. This weight space is shown in figure 39.3. For a 
selection of values of the parameter vector w, smaller inset figures show the 
function of x performed by the network when w is set to those values. Each of 
these smaller figures is equivalent to figure 39.2. Thus each point in w space 
corresponds to a function of x. Notice that the gain of the sigmoid function 
(the gradient of the ramp) increases as the magnitude of w increases. 

Now, the central idea of supervised neural networks is this. Given examples 
of a relationship between an input vector x, and a target t, we hope to make 
the neural network 'learn' a model of the relationship between x and t. A 
successfully trained network will, for any given x, give an output y that is 
close (in some sense) to the target value t. Training the network involves 
searching in the weight space of the network for a value of w that produces a 
function that fits the provided training data well. 

Typically an objective function or error function is defined, as a function 
of w, to measure how well the network with weights set to w solves the task. 
The objective function is a sum of terms, one for each input/target pair {x, t}, 
measuring how close the output y(x; w) is to the target t. The training process 
is an exercise in function minimization - i.e., adjusting w in such a way as to 
find a w that minimizes the objective function. Many function-minimization 
algorithms make use not only of the objective function, but also its gradient 
with respect to the parameters w. For general feedforward neural networks 
the backpropagation algorithm efficiently evaluates the gradient of the output 
y with respect to the parameters w, and thence the gradient of the objective 
function with respect to w. 
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Figure 39.3. Weight space. 
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► 39.3 Training the single neuron as a binary classifier 

We assume we have a data set of inputs {xW}JL] with binary labels {t^}^ =1 , 
and a neuron whose output j/(x; w) is bounded between and 1. We can then 
write down the following error function: 

G(w) = -J2 [i (n) lny(x (n) ;w) + (1 - *( n >)ln(l - y(xW;w))] . (39.11) 

n 

Each term in this objective function may be recognized as the information 
content of one outcome. It may also be described as the relative entropy be- 
tween the empirical probability distribution (p n ' , 1 — v- n ') and the probability 
distribution implied by the output of the neuron (y, 1 — y). The objective func- 
tion is bounded below by zero and only attains this value if y(x( n );w) = t^ 
for all n. 

We now differentiate this objective function with respect to w. 

4! it Exercise 39.2. ^ The backpropagation algorithm. Show that the derivative g = 
dG/dw is given by: 

ir N 

» = £ L = E-(* (n) -» (n) ) a: ? ,) - ( 39 - 12 ) 

^ n=l 

Notice that the quantity e^ n ' = t^ — y^ n > is the error on example n - the 
difference between the target and the output. The simplest thing to do with 
a gradient of an error function is to descend it (even though this is often di- 
mensionally incorrect, since a gradient has dimensions [1/parameter], whereas 
a change in a parameter has dimensions [parameter]). Since the derivative 
dG/dw is a sum of terms g(") defined by 

g V) = _(*(») - y (v)) x M (39.13) 

for n = 1, . . . , N, we can obtain a simple on-line algorithm by putting each 
input through the network one at a time, and adjusting w a little in a direction 
opposite to g("). 

We summarize the whole learning algorithm. 

The on-line gradient- descent learning algorithm 

Architecture. A single neuron has a number / of inputs Xi and one output 
y. Associated with each input is a weight Wi (i = 1, . . . , I). 

Activity rule. 1. First, in response to the received inputs x (which may be 
arbitrary real numbers), we compute the activation of the neuron, 

a = ^2wiXi, (39.14) 

i 

where the sum is over i = 0, . . . , I if there is a bias and i = 
otherwise. 

2. Second, the output y is set as a sigmoid function of the activation. 



This output might be viewed as stating the probability, according to the 
neuron, that the given input is in class 1 rather than class 0. 
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Learning rule. The teacher supplies a target value t € {0, 1} which says 
what the correct answer is for the given input. We compute the error 
signal 

e = t-y (39.16) 

then adjust the weights w in a direction that would reduce the magnitude 
of this error: 

Awi = r/exi, (39.17) 

where n is the 'learning rate'. Commonly n is set by trial and error to a 
constant value or to a decreasing function of simulation time r such as 
Vo/r. 

The activity rule and learning rule are repeated for each input/target pair 
(x, t) that is presented. If there is a fixed data set of size N, we can cycle 
through the data multiple times. 

Batch learning versus on-line learning 

Here we have described the on-line learning algorithm, in which a change in 
the weights is made after every example is presented. An alternative paradigm 
is to go through a hatch of examples, computing the outputs and errors and 
accumulating the changes specified in equation (39.17) which are then made 
at the end of the batch. 



Batch learning for the single neuron classifier 



For each input /target pair (x™,tW) (n = 1,.. 
y( n ) = y(x( n );w), where 


. , N) , compute 


y(x ' w) " l + exp(-X>iSi)' 


(39.18) 


define = t^ — y( n \ and compute for each weij 


dit Wi 


flj B > = -eW,W. 


(39.19) 


Then let 

Awi = -n^gf'. 

n 


(39.20) 


This batch learning algorithm is a gradient descent algorithm, whereas the 
on-line algorithm is a stochastic gradient descent algorithm. Source code 
implementing batch learning is given in algorithm 39.5. This algorithm is 
demonstrated in figure 39.4 for a neuron with two inputs with weights wi and 
u>2 and a bias wq, performing the function 


2/( X ! W ) l e -(w +W 1 X 1 +W 2 X2) ' 


(39.21) 



The bias Wq is included, in contrast to figure 39.3, where it was omitted. The 
neuron is trained on a data set of ten labelled examples. 
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Figure 39.4. A single neuron learning to classify by gradient descent. The neuron has two weights W\ 
and w 2 and a bias w . The learning rate was set to rj = 0.01 and batch-mode gradient 
descent was performed using the code displayed in algorithm 39.5. (a) The training data. 

(b) Evolution of weights wq, w\ and w 2 as a function of number of iterations (on log scale). 

(c) Evolution of weights wi and u>2 in weight space, (d) The objective function G(w) as a 
function of number of iterations, (e) The magnitude of the weights E\y{w) as a function of 
time, (f-k) The function performed by the neuron (shown by three of its contours) after 30, 
80, 500, 3000, 10 000 and 40 000 iterations. The contours shown are those corresponding to 
a = 0, ±1, namely y = 0.5,0.27 and 0.73. Also shown is a vector proportional to (w\,W2)- 
The larger the weights are, the bigger this vector becomes, and the closer together are the 
contours. 
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39 — The Single Neuron as a Classifier 



global x; #xisanN*I matrix containing all the input vectors 


global t ; # t is a vector of 


length N containing all the targets 


for 1 = 1:L # loop L times 




a = x * w ; 


# compute all activations 


y = sigmoid(a) ; 


# compute outputs 


e = t - y ; 


# compute errors 


g = - x' * e ; 


# compute the gradient vector 


w = w - eta * ( g + alpha * w ) 


# make step, using learning rate eta 




# and weight decay alpha 


endf or 




function f = sigmoid ( v ) 




f = 1.0./(1.0.+ exp(-v)) 




endf unction 





Algorithm 39.5. Octave source 
code for a gradient descent 
optimizer of a single neuron, 
batch learning, with optional 
weight decay (rate alpha). 
Octave notation: the instruction 
a = x * w causes the (N x 7) 
matrix x consisting of all the 
input vectors to be multiplied by 
the weight vector w, giving the 
vector a listing the activations for 
all N input vectors; x' means 
x-transpose; the single command 
y = sigmoid (a) computes the 
sigmoid function of all elements of 
the vector a. 
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Figure 39.6. The influence of 
weight decay on a single neuron's 
learning. The objective function is 
M(w) = G(w) + aE w (w). The 
learning method was as in 
figure 39.4. (a) Evolution of 
weights wq, w\ and w 2 . (b) 
Evolution of weights t«i and u>2 in 
weight space shown by points, 
contrasted with the trajectory 
followed in the case of zero weight 
decay, shown by a thin line (from 
figure 39.4). Notice that for this 
problem weight decay has an 
effect very similar to 'early 
stopping', (c) The objective 
function M(w) and the error 
function G(w) as a function of 
number of iterations, (d) The 
function performed by the neuron 
after 40 000 iterations. 
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39.4: Beyond descent on the error function: regularization 479 

► 39.4 Beyond descent on the error function: regularization 

If the parameter r\ is set to an appropriate value, this algorithm works: the 
algorithm finds a setting of w that correctly classifies as many of the examples 
as possible. 

If the examples are in fact linearly separable then the neuron finds this lin- 
ear separation and its weights diverge to ever-larger values as the simulation 
continues. This can be seen happening in figure 39.4(f-k). This is an exam- 
ple of overfitting, where a model fits the data so well that its generalization 
performance is likely to be adversely affected. 

This behaviour may be viewed as undesirable. How can it be rectified? 

An ad hoc solution to overfitting is to use early stopping, that is, use 
an algorithm originally intended to minimize the error function G(w), then 
prevent it from doing so by halting the algorithm at some point. 

A more principled solution to overfitting makes use of regularization. Reg- 
ularization involves modifying the objective function in such a way as to in- 
corporate a bias against the sorts of solution w which we dislike. In the above 
example, what we dislike is the development of a very sharp decision bound- 
ary in figure 39.4k; this sharp boundary is associated with large weight values, 
so we use a regularizer that penalizes large weight values. We modify the 
objective function to: 

M(w) = G(w) + aE w {w) (39.22) 
where the simplest choice of regularizer is the weight decay regularizer 

E w (w) = ^ w l (39-23) 

i 

The regularization constant a is called the weight decay rate. This additional 
term favours small values of w and decreases the tendency of a model to overfit 
fine details of the training data. The quantity a is known as a hyperparameter. 
Hyperparameters play a role in the learning algorithm but play no role in the 
activity rule of the network. 

■ ' Exercise 39.3.^ Compute the derivative of M(w) with respect to Wj. Why is 
45v the above regularizer known as the 'weight decay' regularizer? 

The gradient descent source code of algorithm 39.5 implements weight decay. 
This gradient descent algorithm is demonstrated in figure 39.6 using weight 
decay rates a = 0.01, 0.1, and 1. As the weight decay rate is increased 
the solution becomes biased towards broader sigmoid functions with decision 
boundaries that are closer to the origin. 



Note 

Gradient descent with a step size rj is in general not the most efficient way to 
minimize a function. A modification of gradient descent known as momentum, 
while improving convergence, is also not recommended. Most neural network 
experts use more advanced optimizers such as conjugate gradient algorithms. 
[Please do not confuse momentum, which is sometimes given the symbol a, 
with weight decay.] 
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► 39.5 Further exercises 

More motivations for the linear neuron 

> Exercise 39.4. ^ Consider the task of recognizing which of two Gaussian distri- 
butions a vector z comes from. Unlike the case studied in section 11.2, 
where the distributions had different means but a common variance- 
covariance matrix, we will assume that the two distributions have ex- 
actly the same mean but different variances. Let the probability of z 
given s (s G {0, 1}) be 



P(z\s) = ]J Normal^; 0,<t^), 



(39.24) 



t=i 



where a 2 si is the variance of z% when the source symbol is s. Show that 
P(s = 1 | z) can be written in the form 



P(s = l|z) 



1 



1 + exp(-w T x + 6) 
where xi is an appropriate function of z%, x% = g(zi). 

Exercise 39.5. ^ The noisy LED. 

c(2) 



(39.25) 



c(3) 



-j c(8) 



□ 



Consider an LED display with 7 elements numbered as shown above. The 
state of the display is a vector x. When the controller wants the display 
to show character number s, e.g. s = 2, each element Xj (j = I, ... ,7) 
either adopts its intended state Cj(s), with probability 1 — /, or is flipped, 
with probability /. Let's call the two states of x '+V and ' — 1'. 

(a) Assuming that the intended character s is actually a 2 or a 3, what 
is the probability of s, given the state x? Show that P(s = 2|x) 
can be written in the form 



P(s = 2|x) 



(39.26) 



1 + exp(-w T x + 9) ' 

and compute the values of the weights w in the case / = 0.1. 

(b) Assuming that s is one of {0, 1,2,..., 9}, with prior probabilities 
p s , what is the probability of s, given the state x? Put your answer 
in the form 



P(s|x) 



E e<v ' 



(39.27) 



where {a s } are functions of {cj(s)} and x. 



Could you make a better alphabet of 10 characters for a noisy LED, i.e., 
an alphabet less susceptible to confusion? 

t> bxercise 39.6. [2] A (3,1) error-correcting code consists of the two codewords 
xW = (1, 0, 0) and x< 2 > = (0, 0, 1). A source bit s £ {1, 2} having proba- 
bility distribution {pi,P2} is used to select one of the two codewords for 
transmission over a binary symmetric channel with noise level /. The 
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Table 39.7. An alternative 
15-character alphabet for the 
7-element LED display. 
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39.5: Further exercises 

received vector is r. Show that the posterior probability of s given r can 
be written in the form 

P(s = 1|r) = 7 ^ V 

1 + exp(^-w - J2n=l w nr n ) 

and give expressions for the coefficients {w n }^=i and the bias, wo. 

Describe, with a diagram, how this optimal decoder can be expressed in 
terms of a 'neuron'. 
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Problems to look at before Chapter 40 



> Exercise 40.1. [2] What is J2k=o (k) ? 

nnmoti nn _ ... 

\(N-K) 



[The symbol (£) means the combination ...... ' .... .1 



> Exercise 40.2. If the top row of Pascal's triangle (which contains the single 

number '1') is denoted row zero, what is the sum of all the numbers in 
the triangle above row TV? 

> Exercise 40.3. ^ 3 points are selected at random on the surface of a sphere. 

What is the probability that all of them lie on a single hemisphere? 

This chapter's material is originally due to Polya (1954) and Cover (1965) and 
the exposition that follows is Yaser Abu-Mostafa's. 
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Capacity of a Single Neuron 



N 



{t n r n 



Learning 
algorithm 



w 



N 



W 



Figure 40.1. Neural network 
learning viewed as 
communication. 



► 40.1 Neural network learning as communication 

Many neural network models involve the adaptation of a set of weights w in 
response to a set of data points, for example a set of N target values Djy = 
{tn}n=i at given locations {x„}^ =1 . The adapted weights are then used to 
process subsequent input data. This process can be viewed as a communication 
process, in which the sender examines the data Dn and creates a message w 
that depends on those data. The receiver then uses w; for example, the 
receiver might use the weights to try to reconstruct what the data D N was. 
[In neural network parlance, this is using the neuron for 'memory' rather than 
for 'generalization'; 'generalizing' means extrapolating from the observed data 
to the value of t^+i at some new location xjv+i-] Just as a disk drive is a 
communication channel, the adapted network weights w therefore play the 
role of a communication channel, conveying information about the training 
data to a future user of that neural net. The question we now address is, 
'what is the capacity of this channel?' - that is, 'how much information can 
be stored by training a neural network?' 

If we had a learning algorithm that either produces a network whose re- 
sponse to all inputs is +1 or a network whose response to all inputs is 0, 
depending on the training data, then the weights allow us to distinguish be- 
tween just two sorts of data set. The maximum information such a learning 
algorithm could convey about the data is therefore 1 bit, this information con- 
tent being achieved if the two sorts of data set are equiprobable. How much 
more information can be conveyed if we make full use of a neural network's 
ability to represent other functions? 

► 40.2 The capacity of a single neuron 

We will look at the simplest case, that of a single binary threshold neuron. We 
will find that the capacity of such a neuron is two bits per weight. A neuron 
with K inputs can store 2K bits of information. 

To obtain this interesting result we lay down some rules to exclude less 
interesting answers, such as: 'the capacity of a neuron is infinite, because each 
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40 — Capacity of a Single Neuron 



of its weights is a real number and so can convey an infinite number of bits' . 
We exclude this answer by saying that the receiver is not able to examine the 
weights directly, nor is the receiver allowed to probe the weights by observing 
the output of the neuron for arbitrarily chosen inputs. We constrain the 
receiver to observe the output of the neuron at the same fixed set of N points 
{x„} that were in the training set. What matters now is how many different 
distinguishable functions our neuron can produce, given that we can only 
observe the function at these N points. How many different binary labellings 
of N points can a linear threshold function produce? And how does this 
number compare with the maximum possible number of binary labellings, 
2 N 1 If nearly all of the 2 N labellings can be realized by our neuron, then it is 
a communication channel that can convey all N bits (the target values {t n }) 
with small probability of error. We will identify the capacity of the neuron as 
the maximum value that N can have such that the probability of error is very 
small. [We are departing a little from the definition of capacity in Chapter 9.] 
We thus examine the following scenario. The sender is given a neuron 
with K inputs and a data set which is a labelling of N points. The 
sender uses an adaptive algorithm to try to find a w that can reproduce this 
labelling exactly. We will assume the algorithm finds such a w if it exists. The 
receiver then evaluates the threshold function on the N input values. What 
is the probability that all N bits are correctly reproduced? How large can N 
become, for a given K, without this probability becoming substantially less 
than one? 

General position 

One technical detail needs to be pinned down: what set of inputs {x„} are we 
considering? Our answer might depend on this choice. We will assume that 
the points are in general position. 

Definition 40.1 A set of points {x„} in K -dimensional space are in general 
position if any subset of size < K is linearly independent, and no K + 1 of 
them lie in a (K — I) -dimensional plane. 

In K = 3 dimensions, for example, a set of points are in general position if no 
three points are colinear and no four points are coplanar. The intuitive idea is 
that points in general position are like random points in the space, in terms of 
the linear dependences between points. You don't expect three random points 
in three dimensions to lie on a straight line. 

The linear threshold function 

The neuron we will consider performs the function 



We will not have a bias wq; the capacity for a neuron with a bias can be 
obtained by replacing K by K + 1 in the final result below, i.e., considering 
one of the inputs to be fixed to 1. (These input points would not then be in 
general position; the derivation still works.) 




(40.1) 



where 




(40.2) 
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Figure 40.2. One data point in a 
two-dimensional input space, and 
the two regions of weight space 
that give the two alternative 
labellings of that point. 



► 40.3 Counting threshold functions 

Let us denote by T(N, K) the number of distinct threshold functions on N 
points in general position in K dimensions. We will derive a formula for 
T(N,K). 

To start with, let us work out a few cases by hand. 
In K = 1 dimension, for any N 

The N points lie on a line. By changing the sign of the one weight w\ we can 
label all points on the right side of the origin 1 and the others 0, or vice versa. 
Thus there are two distinct threshold functions. T(N, 1) = 2. 

With N = 1 point, for any K 

If there is just one point xW then we can realize both possible labellings by 
setting w = ±xW. Thus T(1,K) = 2. 

In K = 2 dimensions 

In two dimensions with N points, we are free to spin the separating line around 
the origin. Each time the line passes over a point we obtain a new function. 
Once we have spun the line through 360 degrees we reproduce the function 
we started from. Because the points are in general position, the separating 
plane (line) crosses only one point at a time. In one revolution, every point 
is passed over twice. There are therefore 2N distinct threshold functions. 
T{N,2) = 2N. 

Comparing with the total number of binary functions, 2 N , we may note 
that for TV > 3, not all binary functions can be realized by a linear threshold 
function. One famous example of an unrealizable function with N = 4 and 
K = 2 is the exclusive-or function on the points x = (±1,±1). [These points 
are not in general position, but you may confirm that the function remains 
unrealizable even if the points are perturbed into general position.] 

In K = 2 dimensions, from the point of view of weight space 

There is another way of visualizing this problem. Instead of visualizing a 
plane separating points in the two-dimensional input space, we can consider 
the two-dimensional weight space, colouring regions in weight space different 
colours if they label the given datapoints differently. We can then count the 
number of threshold functions by counting how many distinguishable regions 
there are in weight space. Consider first the set of weight vectors in weight 
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Figure 40.3. Two data points in a 
two-dimensional input space, and 
the four regions of weight space 
that give the four alternative 
labellings. 



Figure 40.4. Three data points in 
a two-dimensional input space, 
and the six regions of weight 
space that give alternative 
labellings of those points. In this 
case, the labellings (0, 0, 0) and 
(1,1,1) cannot be realized. For 
any three points in general 
position there are always two 
labellings that cannot be realized. 



space that classify a particular example as a 1. For example, figure 40.2a 
shows a single point in our two-dimensional x-space, and figure 40.2b shows 
the two corresponding sets of points in w-space. One set of weight vectors 
occupy the half space 

x (ll) -w>0, (40.3) 

and the others occupy x^'-w < 0. In figure 40.3a we have added a second 
point in the input space. There are now 4 possible labellings: (1,1), (1,0), 
(0,1), and (0,0). Figure 40.3b shows the two hyperplanes = and 

x ( 2 )-w = which separate the sets of weight vectors that produce each of 
these labellings. When N = 3 (figure 40.4), weight space is divided by three 
hyperplanes into six regions. Not all of the eight conceivable labellings can be 
realized. Thus T(3, 2) = 6. 

In K = 3 dimensions 

We now use this weight space visualization to study the three dimensional 
case. 

Let us imagine adding one point at a time and count the number of thresh- 
old functions as we do so. When N = 2, weight space is divided by two hy- 
perplanes xW-w = and x( 2 '-w = into four regions; in any one region all 
vectors w produce the same function on the 2 input vectors. Thus T(2, 3) = 4. 

Adding a third point in general position produces a third plane in w space, 
so that there are 8 distinguishable regions. T(3, 3) = 8. The three bisecting 
planes are shown in figure 40.5a. 

At this point matters become slightly more tricky. As figure 40.5b illus- 
trates, the fourth plane in the three-dimensional w space cannot transect all 
eight of the sets created by the first three planes. Six of the existing regions 
are cut in two and the remaining two are unaffected. So T(4, 3) = 14. Two 
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Figure 40.5. Weight space 
illustrations for T(3, 3) and 
T(4,3). (a) T(3,3) = 8. Three 
hyperplanes (corresponding to 
three points in general position) 
divide 3-space into 8 regions, 
shown here by colouring the 
relevant part of the surface of a 
hollow, semi-transparent cube 
centred on the origin, (b) 
T(4, 3) = 14. Four hyperplanes 
divide 3-space into 14 regions, of 
which this figure shows 13 (the 
14th region is out of view on the 
right-hand face. Compare with 
figure 40.5a: all of the regions 
that are not coloured white have 
been cut into two. 



Table 40.6. Values of T(N, K) 
deduced by hand. 




2^ 












/ 

s / 













Figure 40.7. Illustration of the 
cutting process going from T(3, 3) 
to T(4, 3). (a) The eight regions 
of figure 40.5a with one added 
hypcrplane. All of the regions 
that are not coloured white have 
been cut into two. (b) Here, the 
hollow cube has been made solid, 
so we can sec which regions are 
cut by the fourth plane. The front 
half of the cube has been cut 
away, (c) This figure shows the 
new two dimensional hypcrplane, 
which is divided into six regions 
by the three one-dimensional 
hyperplanes (lines) which cross it. 
Each of these regions corresponds 
to one of the three-dimensional 
regions in figure 40.7a which is 
cut into two by this new 
hypcrplane. This shows that 
T(4, 3) - T(3, 3) = 6. Figure 40.7c 
should be compared with figure 
40.4b. 
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488 40 — Capacity of a Single Neuron 

of the binary functions on 4 points in 3 dimensions cannot be realized by a 
linear threshold function. 

We have now filled in the values of T(N, K) shown in table 40.6. Can we 
obtain any insights into our derivation of T(4, 3) in order to fill in the rest of 
the table for T(N, K)l Why was T(4, 3) greater than T(3, 3) by six? 

Six is the number of regions that the new hyperplane bisected in w-space 
(figure 40.7ab). Equivalently, if we look in the K—l dimensional subspace 
that is the ATth hyperplane, that subspace is divided into six regions by the 
N—l previous hyperplanes (figure 40.7c). Now this is a concept we have met 
before. Compare figure 40.7c with figure 40.4b. How many regions are created 
by iV — 1 hyperplanes in a K—l dimensional space? Why, T(N—1,K—1), of 
course! In the present case N = 4, K = 3, we can look up T(3, 2) = 6 in the 
previous section. So 

T(4,3) = T(3,3) +T(3,2). (40.4) 
Recurrence relation for any N, K 

Generalizing this picture, we see that when we add an ATth hyperplane in K 
dimensions, it will bisect T(N—1, K—l) of the T(N —1, K) regions that were 
created by the previous N—l hyperplanes. Therefore, the total number of 
regions obtained after adding the Nth hyperplane is 2T(N — 1, K — 1) (since 
T{N —1, K—l) out of T(N— 1, K) regions are split in two) plus the remaining 
T(N—1, K) - T{N-1,K-1) regions not split by the Nth hyperplane, which 
gives the following equation for T(N,K): 

T(N, K) = T(N-1,K)+T(N-1,K-1). (40.5) 

Now all that remains is to solve this recurrence relation given the boundary 
conditions T(N, 1) = 2 and T(1,K) = 2. 

Does the recurrence relation (40.5) look familiar? Maybe you remember 
building Pascal's triangle by adding together two adjacent numbers in one row 
to get the number below. The N, K element of Pascal's triangle is equal to 

c ^ k ^{k) = W^- (4 °- 6) 



Table 40.8. Pascal's triangle. 
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Combinations (^) satisfy the equation 

C(N,K) = C(N-1,K-1) +C(N-1,K), for all N > 0. (40.7) 

[Here we are adopting the convention that (^) = if K > N or K < 0.] 
So (^) satisfies the required recurrence relation (40.5). This doesn't mean 
T(N,K) = (^), since many functions can satisfy one recurrence relation. 
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40.3: Counting threshold functions 




N/K 

Figure 40.9. The fraction of functions on N points in K dimensions that arc linear threshold functions, 
T(N, K)/2 N , shown from various viewpoints. In (a) we sec the dependence on K, which 
is approximately an error function passing through 0.5 at K = N/2; the fraction reaches 1 
at K = N . In (b) we see the dependence on N, which is 1 up to N = K and drops sharply 
at N = 2K. Panel (c) shows the dependence on N/K for K = 1000. There is a sudden 
drop in the fraction of realizable labcllings when N = 2K. Panel (d) shows the values of 
log 2 T(N, K) and log 2 2 N as a function of N for K = 1000. These figures were plotted 
using the approximation of T/2 N by the error function. 



But perhaps we can express T(N, K) as a linear superposition of combination 
functions of the form C a ^{N,K) = (^J^j). By comparing tables 40.8 and 
40.6 we can see how to satisfy the boundary conditions: we simply need to 
translate Pascal's triangle to the right by 1, 2, 3, . . .; superpose; add; multiply 
by two, and drop the whole table by one line. Thus: 

K ~ l //v_i\ 

T(N,K) = 2^ . (40.8) 



fc=0 



Using the fact that the Nth. row of Pascal's triangle sums to 2 N , that is, 
T,k=o C^k 1 ) = 2AL1 > we can simplify t he cases where K-1>N-1. 

( 2 N K > N 



Interpretation 

It is natural to compare T(N, K) with the total number of binary functions on 
N points, 2 N . The ratio T(N, K)/2 tells us the probability that an arbitrary 
labelling {t n }^ =i can be memorized by our neuron. The two functions are 
equal for all N < K. The line N = K is thus a special line, defining the 
maximum number of points on which any arbitrary labelling can be realized. 
This number of points is referred to as the Vapnik-Chervonenkis dimension 
(VC dimension) of the class of functions. The VC dimension of a binary 
threshold function on K dimensions is thus K. 
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40 — Capacity of a Single Neuron 



What is interesting is (for large K) the number of points N such that 
almost any labelling can be realized. The ratio T(N, K)/2 N is, for N < 2K, 
still greater than 1/2, and for large K the ratio is very close to 1. 

For our purposes the sum in equation (40.9) is well approximated by the 
error function, 

(40.10, 



N/2 J 



where $f» = f*^ exp(-z 2 /2)/^. Fi gure 40.9 shows the realizable fraction 
T(N, K)/2 N as a function of N and K. The take-home message is shown in 
figure 40.9c: although the fraction T(N, K)/2 N is less than 1 for N > K, it is 
only negligibly less than 1 up to N = 2K; there, there is a catastrophic drop 
to zero, so that for N > 2K, only a tiny fraction of the binary labellings can 
be realized by the threshold function. 

Conclusion 

The capacity of a linear threshold neuron, for large K , is 2 bits per weight. 

A single neuron can almost certainly memorize up to N = 2K random 
binary labels perfectly, but will almost certainly fail to memorize more. 



► 40.4 Further exercises 

t> Exercise 40.4. ^ Can a finite set of 2N distinct points in a two-dimensional 
space be split in half by a straight line 

• if the points are in general position? 

• if the points are not in general position? 

Can 2N points in a K dimensional space be split in half by a K — 1 
dimensional hyperplane? 

bxercise 

40 5 [2, P-491] Four pointg 

are selected at random on the surface of a 
sphere. What is the probability that all of them lie on a single hemi- 
sphere? How does this question relate to T(N, K)l 

i^ fS Exercise 40.6.^ Consider the binary threshold neuron in K = 3 dimensions, 
and the set of points {x} = {(1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 1)}. Find 
a parameter vector w such that the neuron memorizes the labels: (a) 

{t} = {l,l,l,l}; (b) {t} = {1,1,0,0}. 

Find an unrealizable labelling {t}. 

> Exercise 40.7. ^ In this chapter we constrained all our hyperplanes to go 
through the origin. In this exercise, we remove this constraint. 

How many regions in a plane are created by N lines in general position? 

Exercise 40.8.^ Estimate in bits the total sensory experience that you have 
had in your life - visual information, auditory information, etc. Estimate 
how much information you have memorized. Estimate the information 
content of the works of Shakespeare. Compare these with the capacity of 
your brain assuming you have 10 11 neurons each making 1000 synaptic 
connections, and that the capacity result for one neuron (two bits per 
connection) applies. Is your brain full yet? 




Figure 40.10. Three lines in a 
plane create seven regions. 
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40.5: Solutions 

> Exercise 40.9. ^ What is the capacity of the axon of a spiking neuron, viewed 
as a communication channel, in bits per second? [See MacKay and 
McCulloch (1952) for an early publication on this topic] Multiply by 
the number of axons in the optic nerve (about 10 6 ) or cochlear nerve 
(about 50 000 per ear) to estimate again the rate of acquisition sensory 
experience. 

► 40.5 Solutions 



Solution to exercise 40.5 (p. 490). The probability that all four points lie on a 
single hemisphere is 

T(4,3)/2 4 = 14/16 = 7/8. (40.11) 
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41 



Learning as Inference 



► 41.1 Neural network learning as inference 

In Chapter 39 we trained a simple neural network as a classifier by minimizing 
an objective function 

M(w) = G(w) + aE w (w) (41.1) 
made up of an error function 

G(w) = -J2 [t (n) ln 2 /(xW;w) + (l-tW)ln(l-j/(x(");w))] (41.2) 

n 

and a regularizer 

fjrl'wi : 



£w(w) = i£>?. (41-3) 



This neural network learning process can be given the following probabilistic 
interpretation. 

We interpret the output y(x; w) of the neuron literally as defining (when 
its parameters w are specified) the probability that an input x belongs to class 
t = 1, rather than the alternative t = 0. Thus j/(x; w) = P(t = 1 | x, w). Then 
each value of w defines a different hypothesis about the probability of class 1 
relative to class as a function of x. 

We define the observed data D to be the targets {t} - the inputs {x} are 
assumed to be given, and not to be modelled. To infer w given the data, we 
require a likelihood function and a prior probability over w. The likelihood 
function measures how well the parameters w predict the observed data; it is 
the probability assigned to the observed t values by the model with parameters 
set to w. Now the two equations 

P(i = l|w,x) = y 

P(t = 0|w,x) = 1-y { ' 

can be rewritten as the single equation 

P(t | w, x) = y*(l - y) 1 -* = exp[t In y + (1 - t) ln(l - y)\ . (41.5) 

So the error function G can be interpreted as minus the log likelihood: 

P(D|w) =exp[-G(w)]. (41.6) 

Similarly the regularizer can be interpreted in terms of a log prior proba- 
bility distribution over the parameters: 

^(w | a) = — ^— exp(-aE w ). (41.7) 
Z w (a) 
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41.2: Illustration for a neuron with two weights 493 

If Eyy is quadratic as denned above, then the corresponding prior distribution 
is a Gaussian with variance o\, = 1/cn, and 1/Zw(ct) is equal to (a/2-K) K ' 2 , 
where K is the number of parameters in the vector w. 

The objective function M(w) then corresponds to the inference of the 
parameters w, given the data: 

. . . P(D I w)P(w I a) , t x 

PMD,a) = (41.8) 



(41.9) 



e -G(w) e -«%(w)^ ff(a ) 

P(D\a) 

= -!-exp(-M(w)). (41.10) 

So the w found by (locally) minimizing M(w) can be interpreted as the (locally) 
most probable parameter vector, w*. From now on we will refer to w* as w MP . 

Why is it natural to interpret the error functions as log probabilities? Error 
functions are usually additive. For example, G is a sum of information con- 
tents, and Ew is a sum of squared weights. Probabilities, on the other hand, 
are multiplicative: for independent events X and Y, the joint probability is 
P(x, y) = P(x)P(y). The logarithmic mapping maintains this correspondence. 

The interpretation of M(w) as a log probability has numerous benefits, 
some of which we will discuss in a moment. 



► 41.2 Illustration for a neuron with two weights 

In the case of a neuron with just two inputs and no bias, 

y(x;w) = t— : r, (41.11) 

" y ' ' 1 _|_ e -(WlXl+W2X2) K ' 

we can plot the posterior probability of w, P(w | D, a) oc exp(— M(w)). Imag- 
ine that we receive some data as shown in the left column of figure 41.1. Each 
data point consists of a two-dimensional input vector x and a t value indicated 
by x (t = 1) or □ (t = 0). The likelihood function exp(— G(w)) is shown as a 
function of w in the second column. It is a product of functions of the form 
(41.11). 

The product of traditional learning is a point in w-space, the estimator w*, 
which maximizes the posterior probability density. In contrast, in the Bayesian 
view, the product of learning is an ensemble of plausible parameter values 
(bottom right of figure 41.1). We do not choose one particular hypothesis w; 
rather we evaluate their posterior probabilities. The posterior distribution is 
obtained by multiplying the likelihood by a prior distribution over w space 
(shown as a broad Gaussian at the upper right of figure 41.1). The posterior 
ensemble (within a multiplicative constant) is shown in the third column of 
figure 41.1, and as a contour plot in the fourth column. As the amount of data 
increases (from top to bottom), the posterior ensemble becomes increasingly 
concentrated around the most probable value w*. 

► 41.3 Beyond optimization: making predictions 

Let us consider the task of making predictions with the neuron which we 
trained as a classifier in section 39.3. This was a neuron with two inputs and 
a bias. 

?/(x;w) = -. ; ; r. (41.12) 

V ' \ _|_ e -(WQ+WlXl+W2X2) 
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494 41 — Learning as Inference 



Data set Likelihood Probability of parameters 




X1 



Figure 41.1. The Bayesian interpretation and generalization of traditional neural network learning. 
Evolution of the probability distribution over parameters as data arrive. 




(a) (b) 5 10 (c) 

Figure 41.2. Making predictions, (a) The function performed by an optimized neuron w MP (shown by 
three of its contours) trained with weight decay, a = 0.01 (from figure 39.6). The contours 
shown arc those corresponding to a = 0, ±1, namely y = 0.5, 0.27 and 0.73. (b) Are these 
predictions more reasonable? (Contours shown are for y = 0.5,0.27, 0.73, 0.12 and 0.88.) 
(c) The posterior probability of w (schematic); the Bayesian predictions shown in (b) were 
obtained by averaging together the predictions made by each possible value of the weights 
w, with each value of w receiving a vote proportional to its probability under the posterior 
ensemble. The method used to create (b) is described in section 41.4. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



41.3: Beyond optimization: making predictions 



When we last played with it, we trained it by minimizing the objective function 

M(w) = G(w) + aE(w). (41.13) 

The resulting optimized function for the case a = 0.01 is reproduced in fig- 
ure 41.2a. 

We now consider the task of predicting the class t< ,v + 1) corresponding to a 
new input x (,v+1) . It is common practice, when making predictions, simply to 
use a neural network with its weights fixed to their optimized value w MP , but 
this is not optimal, as can be seen intuitively by considering the predictions 
shown in figure 41.2a. Are these reasonable predictions? Consider new data 
arriving at points A and B. The best fit model assigns both of these examples 
probability 0.2 of being in class 1, because they have the same value of 
If we really knew that w was equal to w MP , then these predictions would be 
correct. But we do not know w. The parameters are uncertain. Intuitively we 
might be inclined to assign a less confident probability (closer to 0.5) at B than 
at A, as shown in figure 41.2b, since point B is far from the training data. The 
best fit parameters w MP often give over-confident predictions. A non-Bayesian 
approach to this problem is to downweight all predictions uniformly, by an 
empirically determined factor (Copas, 1983). This is not ideal, since intuition 
suggests the strength of the predictions at B should be downweighted more 
than those at A. A Bayesian viewpoint helps us to understand the cause of 
the problem, and provides a straightforward solution. In a nutshell, we obtain 
Bayesian predictions by taking into account the whole posterior ensemble, 
shown schematically in figure 41.2c. 

The Bayesian prediction of a new datum t (JV+1) involves marginalizing over 
the parameters (and over anything else about which we are uncertain). For 
simplicity, let us assume that the weights w are the only uncertain quantities 
- the weight decay rate a and the model TL itself are assumed to be fixed. 
Then by the sum rule, the predictive probability of a new target t (JV+1) at a 
location x (JV+1) is: 

p (t («+D | x ("+i) )£)) a) = j d if w p (t («+D | x (w ); w> Q)p(w J D; a); (41 w) 

where K is the dimensionality of w, three in the toy problem. Thus the 
predictions are obtained by weighting the prediction for each possible w, 

P(t(«+ 1 ) = l|x("+ 1 ),w,a) = y(x^+ 1 );w) 

J P(t( JV + 1 > = 0|x< JV + 1 >,w,a) = l-f/(x< w ';w), K ' 

with a weight given by the posterior probability of w, P(w \D,a), which we 
most recently wrote down in equation (41.10). This posterior probability is 

P(w \D,a) = — exp(-M(w)), (41.16) 
Zm 

where 

Z M = J d K wexp{-M{w j). (41.17) 

In summary, we can get the Bayesian predictions if we can find a way of 
computing the integral 

P{t (N+1) = l\^ N+1) ,D,a) = /d K w 2 /(x (N+1) ;w)^exp(-M(w)), (41.18) 

J Zm 

which is the average of the output of the neuron at x (JV+1) under the posterior 
distribution of w. 
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41 — Learning as Inference 



Q(x;x< 1 )):; ' 




(a) Ls (b) W (c) 

Dumb Metropolis Gradient descent Langevin 



Figure 41.3. One step of the 
Langevin method in two 
dimensions (c), contrasted with a 
traditional 'dumb' Metropolis 
method (a) and with gradient 
descent (b). The proposal density 
of the Langevin method is given 
by 'gradient descent with noise'. 



Implementation 

How shall we compute the integral (41.18)? For our toy problem, the weight 
space is three dimensional; for a realistic neural network the dimensionality 
K might be in the thousands. 

Bayesian inference for general data modelling problems may be imple- 
mented by exact methods (Chapter 25), by Monte Carlo sampling (Chapter 
29), or by deterministic approximate methods, for example, methods that 
make Gaussian approximations to P(w | D, a) using Laplace's method (Chap- 
ter 27) or variational methods (Chapter 33). For neural networks there are few 
exact methods. The two main approaches to implementing Bayesian inference 
for neural networks are the Monte Carlo methods developed by Neal (1996) 
and the Gaussian approximation methods developed by MacKay (1991). 

► 41.4 Monte Carlo implementation of a single neuron 

First we will use a Monte Carlo approach in which the task of evaluating the 
integral (41.18) is solved by treating y(x (JV+1) ; w) as a function / of w whose 
mean we compute using 

(/(w))4e/( wW ) ( 4i - 19 ) 

r 

where {w^} are samples from the posterior distribution -^-exp(— M(w)) (cf. 
equation (29.6)). We obtain the samples using a Metropolis method (section 
29.4). As an aside, a possible disadvantage of this Monte Carlo approach is 
that it is a poor way of estimating the probability of an improbable event, i.e., 
a P(t | D,7i) that is very close to zero, if the improbable event is most likely 
to occur in conjunction with improbable parameter values. 

How to generate the samples {w^}? Radford Neal introduced the Hamil- 
tonian Monte Carlo method to neural networks. We met this sophisticated 
Metropolis method, which makes use of gradient information, in Chapter 30. 
The method we now demonstrate is a simple version of Hamiltonian Monte 
Carlo called the Langevin Monte Carlo method. 

The Langevin Monte Carlo method 

The Langevin method (algorithm 41.4) may be summarized as 'gradient de- 
scent with added noise', as shown pictorially in figure 41.3. A noise vector p 
is generated from a Gaussian with unit variance. The gradient g is computed, 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



41.4: Monte Carlo implementation of a single neuron 



497 



g = gradM ( w ) ; 




# 


set gradient using initial w 


M = findM ( w ) ; 




# 


set objective function too 


■f i — ^ • t 

ior -L — 1 . Li 




# 


loop L times 


p = randn ( size(w) ) ; 




# 


initial momentum is Normal (0,1) 


H = p' *p/2 + M; 




# 


evaluate H(w,p) 


* p = p - epsilon * g / 2 ; 




# 


make half-step in p 


* wnew = w + epsilon * p ; 




# 


make step in w 


* gnew = gradM ( wnew ) ; 




# 


find new gradient 


* p = p - epsilon * gnew / 2 ; 




# 


make half-step in p 


Mnew = findM ( wnew ) ; 




# 


find new objective function 


Hnew = p' * p / 2 + Mnew ; 




# 


evaluate new value of H 


dH = Hnew - H ; 




# 


decide whether to accept 


if ( dH < ) 


accept 


= 1 




elseif ( randO < exp(-dH) ) 


accept 


= 1 


; # compare with a uniform 


else 


accept 


= 


; # variate 


endif 








if ( accept ) g = gnew 


; w = 


wnew ; M = Mnew ; endif 


endf or 








function gM = gradM ( w ) 




# 


gradient of objective function 


a = x * w ; 




# 


compute activations 


y = sigmoid(a) ; 




# 


compute outputs 


e = t - y ; 




# 


compute errors 


g = - x' * e ; 




# 


compute the gradient of G(w) 


gM = alpha * w + g ; 








endf unction 








function M = findM ( w ) 




# 


objective function 


G = - (t> * log(y) + (l-f) 


* log( 


i-y )) ; 


EW = w' * w / 2 ; 








M = G + alpha * EW ; 








endf unction 









Algorithm 41.4. Octave source 
code for the Langcvin Monte 
Carlo method. To obtain the 
Hamiltonian Monte Carlo method, 
we repeat the four lines marked * 
multiple times (algorithm 41.8). 
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10000 20000 30000 40000 




30000 40000 




10000 20000 30000 40000 

(d) 



Figure 41.5. A single neuron learning under the Langevin Monte Carlo method, (a) Evolution of 
weights wo, wi and w 2 as a function of number of iterations, (b) Evolution of weights Wi 
and u>2 in weight space. Also shown by a line is the evolution of the weights using the 
optimizer of figure 39.6. (c) The error function G(w) as a function of number of iterations. 
Also shown is the error function during the optimization of figure 39.6. (d) The objective 
function M(x) as a function of number of iterations. See also figures 41.6 and 41.7. 



and a step in w is made, given by 

Aw= -±6 2 g + ep. (41.20) 

Notice that if the ep term were omitted this would simply be gradient descent 
with learning rate n = ^e 2 . This step in w is accepted or rejected depending 
on the change in the value of the objective function M(w) and on the change 
in gradient, with a probability of acceptance such that detailed balance holds. 

The Langevin method has one free parameter, e, which controls the typical 
step size. If e is set to too large a value, moves may be rejected. If it is set to 
a very small value, progress around the state space will be slow. 



Demonstration of Langevin method 

The Langevin method is demonstrated in figures 41.5, 41.6 and 41.7. Here, the 
objective function is M(w) = G(w) + aEw{w), with a = 0.01. These figures 
include, for comparison, the results of the previous optimization method using 
gradient descent on the same objective function (figure 39.6). It can be seen 
that the mean evolution of w is similar to the evolution of the parameters 
under gradient descent. The Monte Carlo method appears to have converged 
to the posterior distribution after about 10 000 iterations. 

The average acceptance rate during this simulation was 93%; only 7% of 
the proposed moves were rejected. Probably, faster progress around the state 
space would have been made if a larger step size e had been used, but the 
value was chosen so that the 'descent rate' rj = ^e 2 matched the step size of 
the earlier simulations. 



Making Bayesian predictions 

From iteration 10,000 to 40,000, the weights were sampled every 1000 itera- 
tions and the corresponding functions of x are plotted in figure 41.6. There 
is a considerable variety of plausible functions. We obtain a Monte Carlo ap- 
proximation to the Bayesian predictions by averaging these thirty functions of 
x together. The result is shown in figure 41.7 and contrasted with the predic- 
tions given by the optimized parameters. The Bayesian predictions become 
satisfyingly moderate as we move away from the region of highest data density. 
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Figure 41.6. Samples obtained by 
the Langevin Monte Carlo 
method. The learning rate was set 
to T) = 0.01 and the weight decay 
rate to a = 0.01. The step size is 
given by e = y/2rj- The function 
performed by the neuron is shown 
by three of its contours every 1000 
iterations from iteration 10 000 to 
40 000. The contours shown are 
those corresponding to a = 0, ±1, 
namely y = 0.5, 0.27 and 0.73. 
Also shown is a vector 
proportional to (wi,w 2 ). 



5 10 5 10 



Figure 41.7. Bayesian predictions found by the Langevin Monte Carlo method compared with the 
predictions using the optimized parameters, (a) The predictive function obtained by av- 
eraging the predictions for 30 samples uniformly spaced between iterations 10 000 and 
40 000, shown in figure 41.6. The contours shown are those corresponding to a = 0, ±1, ±2, 
namely y = 0.5,0.27, 0.73, 0.12 and 0.88. (b) For contrast, the predictions given by the 
'most probable' setting of the neuron's parameters, as given by optimization of M(w). 
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wnew = w ; 










gnew = g ; 










for tau = l:Tau 










p = p - epsilon * gnew / 2 ; 


# 


make 


half-step in 


P 


wnew = wnew + epsilon * p ; 


# 


make 


step in w 




gnew = gradM ( wnew ) ; 


# 


find 


new gradient 




p = p - epsilon * gnew / 2 ; 


# 


make 


half-step in 


P 


endf or 











10 r- 
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2000 4000 6000 8000 10000 



The Bayesian classifier is better able to identify the points where the classi- 
fication is uncertain. This pleasing behaviour results simply from a mechanical 
application of the rules of probability. 

Optimization and typicality 

A final observation concerns the behaviour of the functions G(w) and M(w) 
during the Monte Carlo sampling process, compared with the values of G and 
M at the optimum w MP (figure 41.5). The function G(w) fluctuates around 
the value of G(w MP ), though not in a symmetrical way. The function M(w) 
also fluctuates, but it does not fluctuate around M(w MP ) - obviously it cannot, 
because M is minimized at w MP , so M could not go any smaller - furthermore, 
M only rarely drops close to M(w MP ). In the language of information theory, 
the typical set o/w has different properties from the most probable state w MP . 

A general message therefore emerges - applicable to all data models, not 
just neural networks: one should be cautious about making use of optimized 
parameters, as the properties of optimized parameters may be unrepresen- 
tative of the properties of typical, plausible parameters; and the predictions 
obtained using optimized parameters alone will often be unreasonably over- 
confident. 

Reducing random walk behaviour using Hamiltonian Monte Carlo 

As a final study of Monte Carlo methods, we now compare the Langevin Monte 
Carlo method with its big brother, the Hamiltonian Monte Carlo method. The 
change to Hamiltonian Monte Carlo is simple to implement, as shown in algo- 
rithm 41.8. Each single proposal makes use of multiple gradient evaluations 



41 — Learning as Inference 

Algorithm 41.8. Octave source 
code for the Hamiltonian Monte 
Carlo method. The algorithm is 
identical to the Langevin method 
in algorithm 41.4, except for the 
replacement of the four lines 
marked * in that algorithm by the 
fragment shown here. 



Figure 41.9. Comparison of 
sampling properties of the 
Langevin Monte Carlo method 
and the Hamiltonian Monte Carlo 
(HMC) method. The horizontal 
axis is the number of gradient 
evaluations made. Each figure 
shows the weights during the first 
10,000 iterations. The rejection 
rate during this Hamiltonian 
Monte Carlo simulation was 8%. 
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along a dynamical trajectory in w, p space, where p are the extra 'momentum' 
variables of the Langevin and Hamiltonian Monte Carlo methods. The num- 
ber of steps 'Tau' was set at random to a number between 100 and 200 for each 
trajectory. The step size e was kept fixed so as to retain comparability with 
the simulations that have gone before; it is recommended that one randomize 
the step size in practical applications, however. 

Figure 41.9 compares the sampling properties of the Langevin and Hamil- 
tonian Monte Carlo methods. The autocorrelation of the state of the Hamil- 
tonian Monte Carlo simulation falls much more rapidly with simulation time 
than that of the Langevin method. For this toy problem, Hamiltonian Monte 
Carlo is at least ten times more efficient in its use of computer time. 



► 41.5 Implementing inference with Gaussian approximations 

Physicists love to take nonlinearities and locally linearize them, and they love 
to approximate probability distributions by Gaussians. Such approximations 
offer an alternative strategy for dealing with the integral 

P(t (JV+1) = l|x (JV+1) ,i},a) = /d K wy(x (,v+1) ;w)-^exp(-M(w)), (41.21) 

J Zm 

which we just evaluated using Monte Carlo methods. 

We start by making a Gaussian approximation to the posterior probability. 
We go to the minimum of M(w) (using a gradient-based optimizer) and Taylor- 
expand M there: 

M(w) ~ M(wmp) + i(w - w MP ) T A(w - w MP ) + • • • , (41.22) 
where A is the matrix of second derivatives, also known as the Hessian, defined 

by 



dwidwj 

We thus define our Gaussian approximation: 

1 



Q(w;w MP ,A) = [det(A/27r)] 1/2 exp 



(w - w MP ) T A(w - w MP ) 



(41.23) 



. (41.24) 



We can think of the matrix A as defining error bars on w. To be precise, Q 
is a normal distribution whose variance-covariance matrix is A -1 . 

4t ' Vj Exercise 41.1. L J Show that the second derivative of M(w) with respect to w 
is given by 

pa n 

" M(w) = £ /^"V!"'^ + (41-25) 



dw;dw« 

J n=l 

where f'(a) is the first derivative of f(a) = 1/(1 + e~ a ), which is 

f'(a) = ^f(a) = f(a)(l-f(a)), (41.26) 

and 

a (») =J2w j x'f ) . (41.27) 



Having computed the Hessian, our task is then to perform the integral (41.21) 
using our Gaussian approximation. 
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Figure 41.10. The marginalized 
probability, and an approximation 
to it. (a) The function tf)(a, s 2 ), 
evaluated numerically. In (b) the 
functions if>(a, s 2 ) and (f>(a, s 2 ) 
defined in the text are shown as a 
function of a for s 2 = 4. From 
MacKay (1992b). 
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Calculating the marginalized probability 

The output j/(x; w) only depends on w through the scalar a(x; w), so we can 
reduce the dimensionality of the integral by finding the probability density of 
a. We are assuming a locally Gaussian posterior probability distribution over 
w = w MP + Aw, P(w\D,a) ~ (1/Zq) exp( — ^Aw T AAw). For our single 
neuron, the activation o(x;w) is a linear function of w with da/d~w = x, so 
for any x, the activation a is Gaussian-distributed. 

t> Exercise 41.2. ^ Assuming w is Gaussian-distributed with mean w MP and 
variance-covariance matrix A -1 , show that the probability distribution 
of o(x) is 

(a - a MP ) 2s 



P(a | x, D, a) = Normal(a MP , s ) 



V27TS 2 



exp 



2s 2 



(41.28) 



where a MP = a(x; w MP ) and ,s 2 = x T A 1 x. 
This means that the marginalized output is: 

P(t=l \x,D,a) = ip{a MP ,s 2 ) = J da f(a) Normal(a MP , s 2 ). (41.29) 

This is to be contrasted with y(x; w MP ) = /(a MP ), the output of the most prob- 
able network. The integral of a sigmoid times a Gaussian can be approximated 
by: 



ip{a MP ,s ) ~ <j){a MP ,s ) = /(K(s)a M p) 



(41.30) 



with k = 1/^1 + tts 2 /8 (figure 41.10). 



Figure 41.11. The Gaussian 
approximation in weight space 
and its approximate predictions in 
input space, (a) A projection of 
the Gaussian approximation onto 
the (wi,u>2) plane of weight 
space. The one- and 
two-standard-deviation contours 
are shown. Also shown are the 
trajectory of the optimizer, and 
the Monte Carlo method's 
samples, (b) The predictive 
function obtained from the 
Gaussian approximation and 
equation (41.30). (cf. figure 41.2.) 



Demonstration 



Figure 41.11 shows the result of fitting a Gaussian approximation at the op- 
timum w MP , and the results of using that Gaussian approximation and equa- 
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41.5: Implementing inference with Gaussian approximations 

tion (41.30) to make predictions. Comparing these predictions with those of 
the Langevin Monte Carlo method (figure 41.7) we observe that, whilst quali- 
tatively the same, the two are clearly numerically different. So at least one of 
the two methods is not completely accurate. 

t> Exercise 41.3. ^ Is the Gaussian approximation to P(w \ D, a) too heavy-tailed 
or too light-tailed, or both? It may help to consider P(w | D, a) as a 
function of one parameter Wi and to think of the two distributions on 
a logarithmic scale. Discuss the conditions under which the Gaussian 
approximation is most accurate. 

Why marginalize? 

If the output is immediately used to make a (0/1) decision and the costs asso- 
ciated with error are symmetrical, then the use of marginalized outputs under 
this Gaussian approximation will make no difference to the performance of the 
classifier, compared with using the outputs given by the most probable param- 
eters, since both functions pass through 0.5 at a MP = 0. But these Bayesian 
outputs will make a difference if, for example, there is an option of saying 'I 
don't know', in addition to saying 'I guess 0' and 'I guess 1'. And even if 
there are just the two choices '0' and '1', if the costs associated with error are 
unequal, then the decision boundary will be some contour other than the 0.5 
contour, and the boundary will be affected by marginalization. 
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Postscript on Supervised Neural 
Networks 



One of my students, Robert, asked: 

Maybe I'm missing something fundamental, but supervised neural 
networks seem equivalent to fitting a pre-defined function to some 
given data, then extrapolating - what's the difference? 

I agree with Robert. The supervised neural networks we have studied so far 
are simply parameterized nonlinear functions which can be fitted to data. 
Hopefully you will agree with another comment that Robert made: 

Unsupervised networks seem much more interesting than their su- 
pervised counterparts. I'm amazed that it works! 
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42 



HopGeld Networks 



We have now spent three chapters studying the single neuron. The time has 
come to connect multiple neurons together, making the output of one neuron 
be the input to another, so as to make neural networks. 

Neural networks can be divided into two classes on the basis of their con- 
nectivity. 




Figure 42.1. (a) A feedforward 
network, (b) A feedback network. 



Feedforward networks. In a feedforward network, all the connections are 
directed such that the network forms a directed acyclic graph. 

Feedback networks. Any network that is not a feedforward network will be 
called a feedback network. 

In this chapter we will discuss a fully connected feedback network called 
the Hopfield network. The weights in the Hopfield network are constrained to 
be symmetric, i.e., the weight from neuron i to neuron j is equal to the weight 
from neuron j to neuron i. 

Hopfield networks have two applications. First, they can act as associative 
memories. Second, they can be used to solve optimization problems. We will 
first discuss the idea of associative memory, also known as content-addressable 
memory. 



► 42.1 Hebbian learning 

In Chapter 38, we discussed the contrast between traditional digital memories 
and biological memories. Perhaps the most striking difference is the associative 
nature of biological memory. 

A simple model due to Donald Hebb (1949) captures the idea of associa- 
tive memory. Imagine that the weights between neurons whose activities are 
positively correlated are increased: 

~ Correlation^, xj). (42.1) 

Now imagine that when stimulus m is present (for example, the smell of a 
banana), the activity of neuron m increases; and that neuron n is associated 
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with another stimulus, n (for example, the sight of a yellow object). If these 
two stimuli - a yellow sight and a banana smell - co-occur in the environment, 
then the Hebbian learning rule (42.1) will increase the weights w nm and w mn . 
This means that when, on a later occasion, stimulus n occurs in isolation, mak- 
ing the activity x n large, the positive weight from n to m will cause neuron m 
also to be activated. Thus the response to the sight of a yellow object is an 
automatic association with the smell of a banana. We could call this 'pattern 
completion'. No teacher is required for this associative memory to work. No 
signal is needed to indicate that a correlation has been detected or that an as- 
sociation should be made. The unsupervised, local learning algorithm and the 
unsupervised, local activity rule spontaneously produce associative memory. 

This idea seems so simple and so effective that it must be relevant to how 
memories work in the brain. 

► 42.2 Definition of the binary Hopfield network 

Convention for weights. Our convention in general will be that Wij denotes 
the connection from neuron j to neuron i. 

Architecture. A Hopfield network consists of / neurons. They are fully 
connected through symmetric, bidirectional connections with weights 
Wij = Wji. There are no self-connections, so wu = for all i. Biases 
Wio may be included (these may be viewed as weights from a neuron '0' 
whose activity is permanently xq = 1). We will denote the activity of 
neuron i (its output) by x%. 

Activity rule. Roughly, a Hopfield network's activity rule is for each neu- 
ron to update its state as if it were a single neuron with the threshold 
activation function 




(42.2) 



Since there is feedback in a Hopfield network (every neuron's output is 
an input to all the other neurons) we will have to specify an order for the 
updates to occur. The updates may be synchronous or asynchronous. 



Synchronous updates - all neurons compute their activations 




(42.3) 



then update their states simultaneously to 



Xi = 6»(o»). 



(42.4) 



Asynchronous updates - one neuron at a time computes its activa- 
tion and updates its state. The sequence of selected neurons may 
be a fixed sequence or a random sequence. 



The properties of a Hopfield network may be sensitive to the above 
choices. 



Learning rule. The learning rule is intended to make a set of desired memo- 
ries {x("'} be stable states of the Hopfield network's activity rule. Each 
memory is a binary pattern, with € {—1,1}. 
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The weights are set using the sum of outer products or Hebh rule, 

Wij = vJ2 x i n)x t l) > ( 42 - 5 ) 



Figure 42.2. Associative memory 
(schematic), (a) A list of desired 
memories, (b) The first purpose of 
an associative memory is pattern 
completion, given a partial 
pattern, (c) The second purpose 
of a memory is error correction. 



where r\ is an unimportant constant. To prevent the largest possible 
weight from growing with N we might choose to set rj = 1/N . 

Exercise 42.1. ^ Explain why the value of n is not important for the Hopfield 
network defined above. 



► 42.3 Definition of the continuous Hopfield network 

Using the identical architecture and learning rule we can define a Hopfield 
network whose activities are real numbers between —1 and 1. 

Activity rule. A Hopfield network's activity rule is for each neuron to up- 
date its state as if it were a single neuron with a sigmoid activation 
function. The updates may be synchronous or asynchronous, and in- 
volve the equations 

ctj = w ij x j (42.6) 

j 

and 

Xi = tanh(aj). (42.7) 

The learning rule is the same as in the binary Hopfield network, but the 
value of r\ becomes relevant. Alternatively, we may fix r\ and introduce a gain 
(3 € (0, oo ) into the activation function: 

Xi = tanh(/3ctj). (42.8) 

• " Exercise 42.2. ^ Where have we encountered equations 42.6, 42.7, and 42.8 
«S>- before? 



► 42.4 Convergence of the Hopfield network 

The hope is that the Hopfield networks we have defined will perform associa- 
tive memory recall, as shown schematically in figure 42.2. We hope that the 
activity rule of a Hopfield network will take a partial memory or a corrupted 
memory, and perform pattern completion or error correction to restore the 
original memory. 

But why should we expect any pattern to be stable under the activity rule, 
let alone the desired memories? 

We address the continuous Hopfield network, since the binary network is 
a special case of it. We have already encountered the activity rule (42.6, 42.8) 
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when we discussed variational methods (section 33.2): when we approximated 
the spin system whose energy function was 



i?(x; J) — ^ ^ J mn x m x n ^ ] h n x n (42.9) 



2 



with a separable distribution 



Q(x; a) = -J- exp |^ a„x„j (42.10) 
and optimized the latter so as to minimize the variational free energy 

0F(a)=0j2 0(x;a)i?(x;J) - Q( X ;a)ln — (42.11) 
we found that the pair of iterative equations 

a m = /3 + h„^j (42.12) 

and 

x„ = tanh(o n ) (42.13) 
were guaranteed to decrease the variational free energy 

0(a) = P (~^2j nm x m x n -J2 h n*n) -T, H 2 ] (9n)- (42.14) 

\ m.n n / n 

If we simply replace J by w, x by x, and h n by u>io, we see that the 
equations of the Hopfield network are identical to a set of mean-field equations 
that minimize 

/?F(x) = -/^x T Wx -Y,Hi e \{l + Xi )/2]. (42.15) 

i 

There is a general name for a function that decreases under the dynamical 
evolution of a system and that is bounded below: such a function is a Lyapunov 
function for the system. It is useful to be able to prove the existence of 
Lyapunov functions: if a system has a Lyapunov function then its dynamics 
are bound to settle down to a fixed point, which is a local minimum of the 
Lyapunov function, or a limit cycle, along which the Lyapunov function is a 
constant. Chaotic behaviour is not possible for a system with a Lyapunov 
function. If a system has a Lyapunov function then its state space can be 
divided into basins of attraction, one basin associated with each attractor. 

So, the continuous Hopfield network's activity rules (if implemented asyn- 
chronously) have a Lyapunov function. This Lyapunov function is a convex 
function of each parameter o« so a Hopfield network's dynamics will always 
converge to a stable fixed point. 

This convergence proof depends crucially on the fact that the Hopfield 
network's connections are symmetric. It also depends on the updates being 
made asynchronously. 

Exercise 42.3. t 2 ' P- 520 l Show by constructing an example that if a feedback 
network does not have symmetric connections then its dynamics may 
fail to converge to a fixed point. 

Exercise 42.4.^' P- 521 l Show by constructing an example that if a Hopfield 
network is updated synchronously that, from some initial conditions, it 
may fail to converge to a fixed point. 
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Figure 42.3. Binary Hopfield 
network storing four memories, 
(a) The four memories, and the 
weight matrix, (b-h) Initial states 
that differ by one, two, three, four, 
or even five bits from a desired 
memory are restored to that 
memory in one or two iterations, 
(i-m) Some initial conditions that 
are far from the memories lead to 
stable states other than the four 
memories; in (i), the stable state 
looks like a mixture of two 
memories, 'D' and 'J'; stable state 
(j) is like a mixture of 'J' and 'C; 
in (k), we find a corrupted version 
of the 'M' memory (two bits 
distant); in (1) a corrupted version 
of 'J' (four bits distant) and in 
(m), a state which looks spurious 
until we recognize that it is the 
inverse of the stable state (1). 
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► 42.5 The associative memory in action 

Figure 42.3 shows the dynamics of a 25- unit binary Hopfield network that 
has learnt four patterns by Hebbian learning. The four patterns are displayed 
as five by five binary images in figure 42.3a. For twelve initial conditions, 
panels (b-m) show the state of the network, iteration by iteration, all 25 
units being updated asynchronously in each iteration. For an initial condition 
randomly perturbed from a memory, it often only takes one iteration for all 
the errors to be corrected. The network has more stable states in addition 
to the four desired memories: the inverse of any stable state is also a stable 
state; and there are several stable states that can be interpreted as mixtures 
of the memories. 

Brain damage 

The network can be severely damaged and still work fine as an associative 
memory. If we take the 300 weights of the network shown in figure 42.3 and 
randomly set 50 or 100 of them to zero, we still find that the desired memories 
are attracting stable states. Imagine a digital computer that still works fine 
even when 20% of its components are destroyed! 

> Exercise 42.5. ^ Implement a Hopfield network and confirm this amazing ro- 
bust error-correcting capability. 

More memories 

We can squash more memories into the network too. Figure 42.4a shows a set 
of five memories. When we train the network with Hebbian learning, all five 
memories are stable states, even when 26 of the weights are randomly deleted 
(as shown by the 'x's in the weight matrix). However, the basins of attraction 
are smaller than before: figures 42.4(b-f) show the dynamics resulting from 
randomly chosen starting states close to each of the memories (3 bits flipped). 
Only three of the memories are recovered correctly. 

If we try to store too many patterns, the associative memory fails catas- 
trophically. When we add a sixth pattern, as shown in figure 42.5, only one 
of the patterns is stable; the others all flow into one of two spurious stable 
states. 

► 42.6 The continuous-time continuous Hopfield network 

The fact that the Hopfield network's properties are not robust to the minor 
change from asynchronous to synchronous updates might be a cause for con- 
cern; can this model be a useful model of biological networks? It turns out 
that once we move to a continuous-time version of the Hopfield networks, this 
issue melts away. 

We assume that each neuron's activity X, is a continuous function of time 
Xi(t) and that the activations a,i(t) are computed instantaneously in accordance 
with 



j 

The neuron's response to its activation is assumed to be mediated by the 
differential equation: 




(42.16) 



-TlXi{t) = --(xi(t) - /(Oi)), 



(42.17) 
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Figure 42.4. Hopfield network 
storing five memories, and 
suffering deletion of 26 of its 300 
weights, (a) The five memories, 
and the weights of the network, 
with deleted weights shown by 'x'. 
(b-f) Initial states that differ by 
three random bits from a 
memory: some are restored, but 
some converge to other states. 
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Figure 42.5. An overloaded 
Hopfield network trained on six 
memories, most of which are not 
"stable. 
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where f(a) is the activation function, for example f(a) = tanh(a). For a 
steady activation Oj, the activity Xi(t) relaxes exponentially to /(a,) with 
time-constant r. 

Now, here is the nice result: as long as the weight matrix is symmetric, 
this system has the variational free energy (42.15) as its Lyapunov function. 

> Exercise 42.6. ^ By computing 4:F, prove that the variational free energy 
-F(x) is a Lyapunov function for the continuous-time Hopfield network. 

It is particularly easy to prove that a function L is a Lyapunov functions if 
the system's dynamics perform steepest descent on L, with ^Xi(t) oc ^-L. 
In the case of the continuous-time continuous Hopfield network, it is not quite 
so simple, but every component of ^xi{t) does have the same sign as g§--F\ 
which means that with an appropriately defined metric, the Hopfield network 
dynamics do perform steepest descents on F(x). 



► 42.7 The capacity of the Hopfield network 

One way in which we viewed learning in the single neuron was as communica- 
tion - communication of the labels of the training data set from one point in 
time to a later point in time. We found that the capacity of a linear threshold 
neuron was 2 bits per weight. 

Similarly, we might view the Hopfield associative memory as a commu- 
nication channel (figure 42.6). A list of desired memories is encoded into a 
set of weights W using the Hebb rule of equation (42.5), or perhaps some 
other learning rule. The receiver, receiving the weights W only, finds the 
stable states of the Hopfield network, which he interprets as the original mem- 
ories. This communication system can fail in various ways, as illustrated in 
the figure. 

1. Individual bits in some memories might be corrupted, that is, a sta- 
ble state of the Hopfield network is displaced a little from the desired 
memory. 

2. Entire memories might be absent from the list of attractors of the net- 
work; or a stable state might be present but have such a small basin of 
attraction that it is of no use for pattern completion and error correction. 

3. Spurious additional memories unrelated to the desired memories might 
be present. 

4. Spurious additional memories derived from the desired memories by op- 
erations such as mixing and inversion may also be present. 



42 — Hopfield Networks 

Figure 42.6. Failure modes of a 
Hopfield network (highly 
schematic). A list of desired 
memories, and the resulting list of 
attracting stable states. Notice 
(1) some memories that arc 
retained with a small number of 
errors; (2) desired memories that 
are completely lost (there is no 
attracting stable state at the 
desired memory or near it); (3) 
spurious stable states unrelated to 
the original list; (4) spurious 
stable states that are 
confabulations of desired 
memories. 
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Of these failure modes, modes 1 and 2 are clearly undesirable, mode 2 espe- 
cially so. Mode 3 might not matter so much as long as each of the desired 
memories has a large basin of attraction. The fourth failure mode might in 
some contexts actually be viewed as beneficial. For example, if a network is 
required to memorize examples of valid sentences such as 'John loves Mary' 
and 'John gets cake', we might be happy to find that 'John loves cake' was also 
a stable state of the network. We might call this behaviour 'generalization'. 

The capacity of a Hopfield network with I neurons might be defined to be 
the number of random patterns N that can be stored without failure-mode 2 
having substantial probability. If we also require failure-mode 1 to have tiny 
probability then the resulting capacity is much smaller. We now study these 
alternative definitions of the capacity. 



The capacity of the Hopfield network - stringent definition 

We will first explore the information storage capabilities of a binary Hopfield 
network that learns using the Hebb rule by considering the stability of just 
one bit of one of the desired patterns, assuming that the state of the network 
is set to that desired pattern x". We will assume that the patterns to be 
stored are randomly selected binary patterns. 
The activation of a particular neuron i is 



WijX) 



(42.18) 



where the weights are, for i ^ j, 



. \ ~> (m) (m) 



(42.19) 



Here we have split W into two terms, the first of which will contribute 'signal' , 
reinforcing the desired memory, and the second 'noise'. Substituting for Wij, 
the activation is 



i +EE 1 



(1-1); 



(42.20) 

(42.21) 



The first term is (I — 1) times the desired state x\ . If this were the only 
term, it would keep the neuron firmly clamped in the desired state. The 
second term is a sum of (7 — 1)(N — 1) random quantities x^Xj^Xj^. A 
moment's reflection confirms that these quantities are independent random 
binary variables with mean and variance 1. 

Thus, considering the statistics of ctj under the ensemble of random pat- 
terns, we conclude that a, has mean (7 — l)x^ and variance (7 — 1)(N — 1). 

For brevity, we will now assume 7 and N are large enough that we can 
neglect the distinction between 7 and 7 — 1, and between N and N — 1. Then 
we can restate our conclusion: Oj is Gaussian-distributed with mean 7a; j™' and 
variance 7iV. 

What then is the probability that the selected bit is stable, if we put the 
network into the state x^? The probability that bit i will flip on the first 
iteration of the Hopfield network's dynamics is 



'IN 



Figure 42.7. The probability 
density of the activation a t in the 
case = 1; the probability that 
bit i becomes nipped is the area 
of the tail. 



P(i unstable) = $ 



/IN 



$ - 



1 



jWi 



(42.22) 
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Figure 42.8. Overlap between a 
desired memory and the stable 
state nearest to it as a function of 
the loading fraction N/I. The 
overlap is defined to be the scaled 

inner product Yli x i x i"^ / 1 ■> which 
is 1 when recall is perfect and zero 
when the stable state has 50% of 
the bits flipped. There is an 
abrupt transition at N/I = 0.138, 
where the overlap drops from 0.97 
to zero. 



where 



$(z)= / & z -L= e - z2 l 2 . (42.23) 



The important quantity N/I is the ratio of the number of patterns stored to 
the number of neurons. If, for example, we try to store N ~ 0.18/ patterns 
in the Hopfield network then there is a chance of 1% that a specified bit in a 
specified pattern will be unstable on the first iteration. 

We are now in a position to derive our first capacity result, for the case 
where no corruption of the desired memories is permitted. 

t> Exercise 42.7. ^ Assume that we wish all the desired patterns to be completely 
stable - we don't want any of the bits to flip when the network is put 
into any desired pattern state - and the total probability of any error at 
all is required to be less than a small number e. Using the approximation 
to the error function for large z, 

1 e- z2 l 2 

$(-z) ~ -= , (42.24) 



/2tt 

show that the maximum number of patterns that can be stored, A max , 
is 

^ 41n/ + 21n(l/ £) - ^ 

If, however, we allow a small amount of corruption of memories to occur, the 
number of patterns that can be stored increases. 



The statistical physicists ' capacity 

The analysis that led to equation (42.22) tells us that if we try to store N ~ 
0.18/ patterns in the Hopfield network then, starting from a desired memory, 
about 1% of the bits will be unstable on the first iteration. Our analysis does 
not shed light on what is expected to happen on subsequent iterations. The 
flipping of these bits might make some of the other bits unstable too, causing 
an increasing number of bits to be flipped. This process might lead to an 
avalanche in which the network's state ends up a long way from the desired 
memory. 

In fact, when N/I is large, such avalanches do happen. When N/I is small, 
they tend not to - there is a stable state near to each desired memory. For the 
limit of large /, Amit et al. (1985) have used methods from statistical physics 
to find numerically the transition between these two behaviours. There is a 
sharp discontinuity at 

A crit = 0.138/. (42.26) 
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Below this critical value, there is likely to be a stable state near every desired 
memory, in which a small fraction of the bits are flipped. When N/I exceeds 
0.138, the system has only spurious stable states, known as spin glass states, 
none of which is correlated with any of the desired memories. Just below the 
critical value, the fraction of bits that are flipped when a desired memory has 
evolved to its associated stable state is 1.6%. Figure 42.8 shows the overlap 
between the desired memory and the nearest stable state as a function of N/I. 

Some other transitions in properties of the model occur at some additional 
values of N/I, as summarized below. 

For all N/I, stable spin glass states exist, uncorrelated with the desired 
memories. 

For N/I > 0.138, these spin glass states are the only stable states. 

For N/I £ (0, 0.138), there are stable states close to the desired memories. 

For N/I £ (0,0.05), the stable states associated with the desired memories 
have lower energy than the spurious spin glass states. 

For N/I £ (0.05,0.138), the spin glass states dominate - there are spin glass 
states that have lower energy than the stable states associated with the 
desired memories. 

For N/I £ (0,0.03), there are additional mixture states, which are combina- 
tions of several desired memories. These stable states do not have as low 
energy as the stable states associated with the desired memories. 

In conclusion, the capacity of the Hopfield network with / neurons, if we 
define the capacity in terms of the abrupt discontinuity discussed above, is 
0.138/ random binary patterns, each of length /, each of which is received 

with 1.6% of its bits flipped. In bits, this capacity is This expression for the capacity 

omits a smaller negative term of 
0.1387 2 x (1 - ff 2 (0.016)) = 0.122 I 2 bits. (42.27) order iVlog 2 N bits, associated 

with the arbitrary order of the 

Since there are I 2 /2 weights in the network, we can also express the capacity memories. 

as 0.24 bits per weight. 



► 42.8 Improving on the capacity of the Hebb rule 

The capacities discussed in the previous section are the capacities of the Hop- 
field network whose weights are set using the Hebbian learning rule. We can 
do better than the Hebb rule by defining an objective function that measures 
how well the network stores all the memories, and minimizing it. 

For an associative memory to be useful, it must be able to correct at 
least one flipped bit. Let's make an objective function that measures whether 
flipped bits tend to be restored correctly. Our intention is that, for every 
neuron i in the network, the weights to that neuron should satisfy this rule: 

for every pattern x'"', if the neurons other than i are set correctly 

(n) 

to Xj = x - , then the activation of neuron i should be such that 
its preferred output 

Is this rule a familiar idea? Yes, it is precisely what we wanted the single 
neuron of Chapter 39 to do. Each pattern defines an input, target pair 
for the single neuron i. And it defines an input, target pair for all the other 
neurons too. 
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w = x' * x ; # initialize the weights using Hebb rule 

for 1 = 1:L # loop L times 



for i=l:I 

w(i,i) = ; 
end 



endf or 



# ensure the self -weights are zero. 
# 



a 


= x * w ; 


# compute all activations 


y 


= sigmoid(a) ; 


# compute all outputs 


e 


= t - y ; 


# compute all errors 


gw 


= x' * e ; 


# compute the gradients 


gw 


= gw + gw' ; 


# symmetrize gradients 


w 


= w + eta * ( gw 


- alpha * w ) ; # make step 



Algorithm 42.9. Octave source 
code for optimizing the weights of 
a Hopfield network, so that it 
works as an associative memory, 
cf. algorithm 39.5. The data 
matrix x has I columns and N 
rows. The matrix t is identical to 
x except that —Is arc replaced by 
Os. 



So, just as we denned an objective function (39.11) for the training of a 
single neuron as a classifier, we can define 

G(W) = -EE t'ln^ + (1 " tVna - y\ n) ) (42.28) 

i n 

where 

I x\> = -1 

and 

= --, where a\ n) = £ w ijX f . (42.30) 

1 + exp(-o> ') 

We can then steal the algorithm (algorithm 39.5, p. 478) which we wrote for 
the single neuron, to write an algorithm for optimizing a Hopfield network, 
algorithm 42.9. The convenient syntax of Octave requires very few changes; 
the extra lines enforce the constraints that the self-weights wa should all be 
zero and that the weight matrix should be symmetrical (w^ = Wji). 

As expected, this learning algorithm does a better job than the one-shot 
Hebbian learning rule. When the six patterns of figure 42.5, which cannot be 
memorized by the Hebb rule, are learned using algorithm 42.9, all six patterns 
become stable states. 

Exercise 42.8. Implement this learning rule and investigate empirically its 
capacity for memorizing random patterns; also compare its avalanche 
properties with those of the Hebb rule. 



► 42.9 Hopfield networks for optimization problems 

Since a Hopfield network's dynamics minimize an energy function, it is natural 
to ask whether we can map interesting optimization problems onto Hopfield 
networks. Biological data processing problems often involve an element of 
constraint satisfaction - in scene interpretation, for example, one might wish 
to infer the spatial location, orientation, brightness and texture of each visible 
element, and which visible elements are connected together in objects. These 
inferences are constrained by the given data and by prior knowledge about 
continuity of objects. 
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Figure 42.10. Hopfield network for 
solving a travelling salesman 
problem with K = 4 cities. (al,2) 
Two solution states of the 
16-neuron network, with activites 
represented by black = 1, white = 
0; and the tours corresponding to 
these network states, (b) The 
negative weights between node B2 
and other nodes; these weights 
enforce validity of a tour, (c) The 
negative weights that embody the 
distance objective function. 



Hopfield and Tank (1985) suggested that one might take an interesting 
constraint satisfaction problem and design the weights of a binary or contin- 
uous Hopfield network such that the settling process of the network would 
minimize the objective function of the problem. 

The travelling salesman problem 

A classic constraint satisfaction problem to which Hopfield networks have been 
applied is the travelling salesman problem. 

A set of K cities is given, and a matrix of the K{K— 1)/2 distances between 
those cities. The task is to find a closed tour of the cities, visiting each city 
once, that has the smallest total distance. The travelling salesman problem is 
equivalent in difficulty to an NP-complete problem. 

The method suggested by Hopfield and Tank is to represent a tentative so- 
lution to the problem by the state of a network with / = K 2 neurons arranged 
in a square, with each neuron representing the hypothesis that a particular 
city comes at a particular point in the tour. It will be convenient to consider 
the states of the neurons as being between and 1 rather than —1 and 1. 
Two solution states for a four-city travelling salesman problem are shown in 
figure 42.10a. 

The weights in the Hopfield network play two roles. First, they must define 
an energy function which is minimized only when the state of the network 
represents a valid tour. A valid state is one that looks like a permutation 
matrix, having exactly one '1' in every row and one '1' in every column. This 
rule can be enforced by putting large negative weights between any pair of 
neurons that are in the same row or the same column, and setting a positive 
bias for all neurons to ensure that K neurons do turn on. Figure 42.10b shows 
the negative weights that are connected to one neuron, l B2\ which represents 
the statement 'city B comes second in the tour'. 

Second, the weights must encode the objective function that we want 
to minimize - the total distance. This can be done by putting negative 
weights proportional to the appropriate distances between the nodes in adja- 
cent columns. For example, between the B and D nodes in adjacent columns, 
the weight would be —dsD- The negative weights that are connected to neu- 
ron B2 are shown in figure 42.10c. The result is that when the network is in 
a valid state, its total energy will be the total distance of the corresponding 
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Figure 42.11. (a) Evolution of the 
state of a continuous Hopfield 
network solving a travelling 
salesman problem using Aiyer's 
(1991) graduated non-convexity 
method; the state of the network 
is projected into the 
two-dimensional space in which 
the cities arc located by finding 
the centre of mass for each point 
in the tour, using the neuron 
activities as the mass function, 
(b) The travelling scholar 
problem. The shortest tour 
linking the 27 Cambridge 
Colleges, the Engineering 
Department, the University 
Library, and Sree Aiyer's house. 
From Aiycr (1991). 



(b) 



tour, plus a constant given by the energy associated with the biases. 

Now, since a Hopfield network minimizes its energy, it is hoped that the 
binary or continuous Hopfield network's dynamics will take the state to a 
minimum that is a valid tour and which might be an optimal tour. This hope 
is not fulfilled for large travelling salesman problems, however, without some 
careful modifications. We have not specified the size of the weights that enforce 
the tour's validity, relative to the size of the distance weights, and setting this 
scale factor poses difficulties. If 'large' validity-enforcing weights are used, 
the network's dynamics will rattle into a valid state with little regard for the 
distances. If 'small' validity-enforcing weights are used, it is possible that the 
distance weights will cause the network to adopt an invalid state that has lower 
energy than any valid state. Our original formulation of the energy function 
puts the objective function and the solution's validity in potential conflict 
with each other. This difficulty has been resolved by the work of Sree Aiyer 
(1991), who showed how to modify the distance weights so that they would not 
interfere with the solution's validity, and how to define a continuous Hopfield 
network whose dynamics are at all times confined to a 'valid subspace'. Aiyer 
used a graduated non-convexity or deterministic annealing approach to find 
good solutions using these Hopfield networks. The deterministic annealing 
approach involves gradually increasing the gain (3 of the neurons in the network 
from to oo, at which point the state of the network corresponds to a valid 
tour. A sequence of trajectories generated by applying this method to a thirty- 
city travelling salesman problem is shown in figure 42.11a. 

A solution to the 'travelling scholar problem' found by Aiyer using a con- 
tinuous Hopfield network is shown in figure 42.11b. 
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42.10: Further exercises 519 

► 42.10 Further exercises 

t> Exercise 42.9.^ Storing two memories. 

Two binary memories m and n (mi, in G {— 1,+1}) are stored by Heb- 
bian learning in a Hopfield network using 

_ f m imj + mnj for i ^ j 
W v-\ for i = j. (iZ - 6L) 

The biases bi are set to zero. 

The network is put in the state x = m. Evaluate the activation a, of 
neuron i and show that in can be written in the form 

di = firrii + vrii. (42.32) 

By comparing the signal strength, ji, with the magnitude of the noise 
strength, \v\, show that x = m is a stable state of the dynamics of the 
network. 

The network is put in a state x differing in D places from m, 

x=m + 2d, (42.33) 

where the perturbation d satisfies di G {— 1,0, +1}. D is the number 
of components of d that are non-zero, and for each di that is non-zero, 
di = —mi. Defining the overlap between m and n to be 

/ 

o mn = ^2 mni, (42.34) 
i=i 

evaluate the activation a» of neuron i again and show that the dynamics 
of the network will restore x to m if the number of flipped bits satisfies 

D < i(J- |o mn | - 2). (42.35) 



How does this number compare with the maximum number of flipped 
bits that can be corrected by the optimal decoder, assuming the vector 
x is either a noisy version of m or of n? 

Exercise 42.10. ^ Hopfield network as a collection of binary classifiers. This ex- 
ercise explores the link between unsupervised networks and supervised 
networks. If a Hopfield network's desired memories are all attracting 
stable states, then every neuron in the network has weights going to it 
that solve a classification problem personal to that neuron. Take the set 
of memories and write them in the form -x'^ n \xf'\ where x' denotes all 
the components for all i' ^ i, and let w' denote the vector of weights 
wu>, for i' y^i. 

Using what we know about the capacity of the single neuron, show that 
it is almost certainly impossible to store more than 21 random memories 
in a Hopfield network of / neurons. 
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Lyapunov functions 

Exercise 42.11. ^ Erik's puzzle. In a stripped-down version of Conway's game 
of life, cells are arranged on a square grid. Each cell is either alive or 
dead. Live cells do not die. Dead cells become alive if two or more of 
their immediate neighbours are alive. (Neighbours to north, south, east 
and west.) What is the smallest number of live cells needed in order 
that these rules lead to an entire N x N square being alive? 

In a d-dimensional version of the same game, the rule is that if d neigh- 
bours are alive then you come to life. What is the smallest number of 
live cells needed in order that an entire N x N x ■ ■ ■ x N hypercube 
becomes alive? (And how should those live cells be arranged?) 

The southeast puzzle 



Figure 42.12. Erik's dynamics. 
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The southeast puzzle is played on a semi-infinite chess board, starting at 
its northwest (top left) corner. There are three rules: 

1 . In the starting position, one piece is placed in the northwest-most square 
(figure 42.13a). 

2. It is not permitted for more than one piece to be on any given square. 

3. At each step, you remove one piece from the board, and replace it with 
two pieces, one in the square immediately to the east, and one in the the 
square immediately to the south, as illustrated in figure 42.13b. Every 
such step increases the number of pieces on the board by one. 

After move (b) has been made, either piece may be selected for the next move. 
Figure 42.13c shows the outcome of moving the lower piece. At the next move, 
either the lowest piece or the middle piece of the three may be selected; the 
uppermost piece may not be selected, since that would violate rule 2. At move 
(d) we have selected the middle piece. Now any of the pieces may be moved, 
except for the leftmost piece. 
Now, here is the puzzle: 

> Exercise 42.12.^' P- 521 l I s it possible to obtain a position in which all the ten 
squares closest to the northwest corner, marked in figure 42.13z, are 
empty? 

[Hint: this puzzle has a connection to data compression.] 



Figure 42.13. The southeast 
puzzle. 



► 42.11 Solutions 

Solution to exercise 42.3 (p. 508). Take a binary feedback network with 2 neu- 
rons and let wu = 1 and W21 = — 1. Then whenever neuron 1 is updated, 
it will match neuron 2, and whenever neuron 2 is updated, it will flip to the 
opposite state from neuron 1. There is no stable state. 
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42.11: Solutions 

Solution to exercise 42.4 (p. 508). Take a binary Hopfield network with 2 neu- 
rons and let wu = u?2i = 1, and let the initial condition be x\ = 1, X2 = — 1. 
Then if the dynamics are synchronous, on every iteration both neurons will 
flip their state. The dynamics do not converge to a fixed point. 

Solution to exercise 42.12 (p. 520). The key to this problem is to notice its 
similarity to the construction of a binary symbol code. Starting from the 
empty string, we can build a binary tree by repeatedly splitting a codeword 
into two. Every codeword has an implicit probability 2~', where I is the 
depth of the codeword in the binary tree. Whenever we split a codeword in 
two and create two new codewords whose length is increased by one, the two 
new codewords each have implicit probability equal to half that of the old 
codeword. For a complete binary code, the Kraft equality affirms that the 
sum of these implicit probabilities is 1. 

Similarly, in southeast, we can associate a 'weight' with each piece on the 
board. If we assign a weight of 1 to any piece sitting on the top left square; 
a weight of 1/2 to any piece on a square whose distance from the top left is 
one; a weight of 1/4 to any piece whose distance from the top left is two; and 

so forth, with 'distance' being the city-block distance; then every legal move I I I I . . . 

in southeast leaves unchanged the total weight of all pieces on the board. ~ ~ 

Lyapunov functions come in two flavours: the function may be a function of ~ — ~ — ~ 

state whose value is known to stay constant; or it may be a function of state — — 

that is bounded below, and whose value always decreases or stays constant. 

The total weight is a Lyapunov function of the second type. 

The starting weight is 1, so now we have a powerful tool: a conserved : 
function of the state. Is it possible to find a position in which the ten highest- 
weight squares are vacant, and the total weight is 1? What is the total weight Figure 42.14. A possible posit' 
if all the other squares on the board are occupied (figure 42.14)? The total for the southeast puzzle? 
weight would be $^ 4 (i + 1)2"', which is equal to 3/4. So it is impossible to 
empty all ten of those squares. 
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43 



Boltzmann Machines 



► 43.1 From Hopfield networks to Boltzmann machines 

We have noticed that the binary Hopfield network minimizes an energy func- 
tion 

E(x) = -ix T Wx (43.1) 

and that the continuous Hopfield network with activation function x n = 
tanh(a„) can be viewed as approximating the probability distribution asso- 
ciated with that energy function, 



P(x | W) 



1 



Z(W) 



exp[-P(x)] 



1 



Z(W) 



exp 



ix T Wx 



(43.2) 



These observations motivate the idea of working with a neural network model 
that actually implements the above probability distribution. 

The stochastic Hopfield network or Boltzmann machine (Hinton and Se- 
jnowski, 1986) has the following activity rule: 



Activity rule of Boltzmann machine: after computing the activa- 
tion ai (42.3), 



set Xi = +1 with probability 



-2<n 



else set 



-1. 



(43.3) 



This rule implements Gibbs sampling for the probability distribution (43.2). 
Boltzmann machine learning 

Given a set of examples {x^™'}^ from the real world, we might be interested 
in adjusting the weights W such that the generative model 



P(x| W) 



Z(W) 



exp 



ix T Wx 

2 



(43.4) 



is well matched to those examples. We can derive a learning algorithm by 
writing down Bayes' theorem to obtain the posterior probability of the weights 
given the data: 



P(W|{xW}f}) 



N 



'l[ P(xW | W) 



P(W) 



P({xH}f}) 



(43.5) 
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43.1: From HopGeld networks to Boltzmann machines 



We concentrate on the first term in the numerator, the likelihood, and derive a 
maximum likelihood algorithm (though there might be advantages in pursuing 
a full Bayesian approach as we did in the case of the single neuron). We 
differentiate the logarithm of the likelihood, 



In 



JV 



Yl P(xW |W) 



N n 

-x (n) WxW -lnZ(W) 



(43.6) 



with respect to tiiy, bearing in mind that W is defined to be symmetric with 



Exercise 43.1. ^ Show that the derivative of lnZ(W) with respect to toy is 

In Z{ W) = XiXjP(x I W) = (x iXj ) P(x , w) . (43.7) 
ow t j x 

[This exercise is similar to exercise 22.12 (p. 307).] 
The derivative of the log likelihood is therefore: 

TV 

E[ (n) (n) _ / i 
\ x i x j \ X i X j/P(x\W) 



d 



lnP({x (n) }f}|W) 



n=l 

N (a;i^j)rj a ta ~~ i x i x j) 



3/P(x\ W) I 



(43.8) 
(43.9) 



This gradient is proportional to the difference of two terms. The first term is 
the empirical correlation between x^ and Xj, 



i x i x j) 



3 1 Data 



1 N 



n=l 



x^xf 



(43.10) 



and the second term is the correlation between Xi and Xj under the current 



model, 



(xiXj)p( x \W) = ^2 x i x 3 P ( x \ W )- 



(43.11) 



The first correlation { x i x j)-£, ai& is readily evaluated - it is just the empirical 
correlation between the activities in the real world. The second correlation, 
(xiXj) p , x | W ), is not so easy to evaluate, but it can be estimated by Monte 
Carlo methods, that is, by observing the average value of XiXj while the ac- 
tivity rule of the Boltzmann machine, equation (43.3), is iterated. 

In the special case W = 0, we can evaluate the gradient exactly because, 
by symmetry, the correlation { x i x j) p(^\ yv) mus t be zero. If the weights are 
adjusted by gradient descent with learning rate rj, then, after one iteration, 
the weights will be 



Wii 



N 



(43.12) 



71=1 



precisely the value of the weights given by the Hebb rule, equation (16.5), with 
which we trained the Hopfield network. 



Interpretation of Boltzmann machine learning 

One way of viewing the two terms in the gradient (43.9) is as 'waking' and 
'sleeping' rules. While the network is 'awake', it measures the correlation 
between Xi and Xj in the real world, and weights are increased in proportion. 
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While the network is 'asleep', it 'dreams' about the world using the generative 
model (43.4), and measures the correlations between Xi and Xj in the model 
world; these correlations determine a proportional decrease in the weights. If 
the second-order correlations in the dream world match the correlations in the 
real world, then the two terms balance and the weights do not change. 

Criticism of Hopfield networks and simple Boltzmann machines 

Up to this point we have discussed Hopfield networks and Boltzmann machines 
in which all of the neurons correspond to visible variables X{. The result 
is a probabilistic model that, when optimized, can capture the second-order 
statistics of the environment. [The second-order statistics of an ensemble 
P(x) are the expected values (x(Xj) of all the pairwise products X(Xj.} The 
real world, however, often has higher-order correlations that must be included 
if our description of it is to be effective. Often the second-order correlations 
in themselves may carry little or no useful information. 

Consider, for example, the ensemble of binary images of chairs. We can 
imagine images of chairs with various designs - four-legged chairs, comfy 
chairs, chairs with five legs and wheels, wooden chairs, cushioned chairs, chairs 
with rockers instead of legs. A child can easily learn to distinguish these images 
from images of carrots and parrots. But I expect the second-order statistics of 
the raw data are useless for describing the ensemble. Second-order statistics 
only capture whether two pixels are likely to be in the same state as each 
other. Higher-order concepts are needed to make a good generative model of 
images of chairs. 

A simpler ensemble of images in which high-order statistics are important 
is the 'shifter ensemble', which comes in two flavours. Figure 43.1a shows a 
few samples from the 'plain shifter ensemble'. In each image, the bottom eight 
pixels are a copy of the top eight pixels, either shifted one pixel to the left, 
or unshifted, or shifted one pixel to the right. (The top eight pixels are set 
at random.) This ensemble is a simple model of the visual signals from the 
two eyes arriving at early levels of the brain. The signals from the two eyes 
are similar to each other but may differ by small translations because of the 
varying depth of the visual world. This ensemble is simple to describe, but its 
second-order statistics convey no useful information. The correlation between 
one pixel and any of the three pixels above it is 1/3. The correlation between 
any other two pixels is zero. 

Figure 43.1b shows a few samples from the 'labelled shifter ensemble'. 
Here, the problem has been made easier by including an extra three neu- 
rons that label the visual image as being an instance of either the 'shift left', 
'no shift', or 'shift right' sub-ensemble. But with this extra information, the 
ensemble is still not learnable using second-order statistics alone. The second- 
order correlation between any label neuron and any image neuron is zero. We 
need models that can capture higher-order statistics of an environment. 

So, how can we develop such models? One idea might be to create models 
that directly capture higher-order correlations, such as: 

P'(x| W, V, . . .) = — exp I ^^WijXiXj + - ^VijkXiXjXk H 

\ ij ij 

(43.13) 

Such higher-order Boltzmann machines are equally easy to simulate using 
stochastic updates, and the learning rule for the higher-order parameters 
is equivalent to the learning rule for 
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Figure 43.1. The 'shifter' 
ensembles, (a) Four samples from 
the plain shifter ensemble, (b) 
Four corresponding samples from 
the labelled shifter ensemble. 
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43.2: Boltzmann machine with hidden units 

> Exercise 43.2. ^ Derive the gradient of the log likelihood with respect to Vijk- 

It is possible that the spines found on biological neurons are responsible for 
detecting correlations between small numbers of incoming signals. However, 
to capture statistics of high enough order to describe the ensemble of images 
of chairs well would require an unimaginable number of terms. To capture 
merely the fourth-order statistics in a 128 x 128 pixel image, we need more 
than 10 7 parameters. 

So measuring moments of images is not a good way to describe their un- 
derlying structure. Perhaps what we need instead or in addition are hidden 
variables, also known to statisticians as latent variables. This is the important 
innovation introduced by Hinton and Sejnowski (1986). The idea is that the 
high-order correlations among the visible variables are described by includ- 
ing extra hidden variables and sticking to a model that has only second-order 
interactions between its variables; the hidden variables induce higher-order 
correlations between the visible variables. 

► 43.2 Boltzmann machine with hidden units 

We now add hidden neurons to our stochastic model. These are neurons that 
do not correspond to observed variables; they are free to play any role in the 
probabilistic model defined by equation (43.4). They might actually take on 
interpretable roles, effectively performing 'feature extraction'. 

Learning in Boltzmann machines with hidden units 

The activity rule of a Boltzmann machine with hidden units is identical to that 
of the original Boltzmann machine. The learning rule can again be derived 
by maximum likelihood, but now we need to take into account the fact that 
the states of the hidden units are unknown. We will denote the states of the 
visible units by x, the states of the hidden units by h, and the generic state 
of a neuron (either visible or hidden) by yi, with y = (x,h). The state of the 
network when the visible neurons are clamped in state x 

(n) igy(n) = ( x («) ; h). 

The likelihood of W given a single data example x'™* 1 is 



P(x<"> |W) = ^P(xW,h 



h 



w ) = E^W) ex p[^y (n) ] T wy (; 



(43.14) 



where 




x.h 



(43.15) 



Equation (43.14) may also be written 



P(x<") | W) 



Z(W) 



(43.16) 



where 




h 



(43.17) 



Differentiating the likelihood as before, we find that the derivative with re- 
spect to any weight Wy is again the difference between a 'waking' term and a 
'sleeping' term, 



-lnP({ X W}f | W) = X){<»*W)p(h|x(»),W) " (Wfj)p(x,h|W)} ' 



n 



(43.18) 
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The first term (j/tJ/j')p(h|xW W) ^ s * ne correlation between yi and yj if the 
Boltzmann machine is simulated with the visible variables clamped to x'"' 
and the hidden variables freely sampling from their conditional distribution. 

The second term (UiUj) pi x h\W) ^ s * ne correlation between yi and yj when 
the Boltzmann machine generates samples from its model distribution. 

Hinton and Sejnowski demonstrated that non-trivial ensembles such as 
the labelled shifter ensemble can be learned using a Boltzmann machine with 
hidden units. The hidden units take on the role of feature detectors that spot 
patterns likely to be associated with one of the three shifts. 

The Boltzmann machine is time-consuming to simulate because the compu- 
tation of the gradient of the log likelihood depends on taking the difference of 
two gradients, both found by Monte Carlo methods. So Boltzmann machines 
are not in widespread use. It is an area of active research to create models 
that embody the same capabilities using more efficient computations (Hinton 
et al, 1995; Dayan et al, 1995; Hinton and Ghahramani, 1997; Hinton, 2001; 
Hinton and Teh, 2001). 

► 43.3 Exercise 

t> Exercise 43.3. ^ Can the 'bars and stripes' ensemble (figure 43.2) be learned 
by a Boltzmann machine with no hidden units? [You may be surprised!] 



□□□□□ 
□□□□□ 
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□□□□□ 
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□□□□□ 



Figure 43.2. Four samples from 
the 'bars and stripes' ensemble. 
Each sample is generated by first 
picking an orientation, horizontal 
or vertical; then, for each row of 
spins in that orientation (each bar 
or stripe respectively), switching 
all spins on with probability V2. 
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Supervised Learning in Multilayer 
Networks 



► 44.1 Multilayer perceptrons 

No course on neural networks could be complete without a discussion of su- 
pervised multilayer networks, also known as backpropagation networks. 

The multilayer perceptron is a feedforward network. It has input neurons, 
hidden neurons and output neurons. The hidden neurons may be arranged 
in a sequence of layers. The most common multilayer perceptrons have a 
single hidden layer, and are known as 'two-layer' networks, the number 'two' 
counting the number of layers of neurons not including the inputs. 

Such a feedforward network defines a nonlinear parameterized mapping 
from an input x to an output y = y(x;w,„4). The output is a continuous 
function of the input and of the parameters w; the architecture of the net, i.e., 
the functional form of the mapping, is denoted by A. Feedforward networks 
can be 'trained' to perform regression and classification tasks. 

Regression networks 

In the case of a regression problem, the mapping for a network with one hidden 
layer may have the form: 

Hidden layer: af = ^ wfx x + Bf; hj = f^af*) (44.1) 



)QO 




Outputs 



OOOOOOO Hiddens 
OOOOOO Inputs 



Figure 44.1. A typical two-layer 
network, with six inputs, seven 
hidden units, and three outputs. 
Each line represents one weight. 



Output layer: a 



(2) 



ol 2) ; W = /< a >(a?>) 



(44.2) 



where, for example, f^(a) = tanh(o), and f( 2 '(a) = a. Here I runs over 
the inputs x\, . . . ,xl, j runs over the hidden units, and i runs over the out- 
puts. The 'weights' w and 'biases' 9 together make up the parameter vector 
w. The nonlinear sigmoid function at the hidden layer gives the neu- 
ral network greater computational flexibility than a standard linear regression 
model. Graphically, we can represent the neural network as a set of layers of 
connected neurons (figure 44.1). 

What sorts of functions can these networks implement? 

Just as we explored the weight space of the single neuron in Chapter 39, 
examining the functions it could produce, let us explore the weight space of 
a multilayer network. In figures 44.2 and 44.3 I take a network with one 
input and one output and a large number H of hidden units, set the biases 




Figure 44.2. Samples from the 
prior over functions of a one-input 
network. For each of a sequence of 
values of cr bias = 8, 6, 4, 3, 2, 1.6, 
1.2, 0.8, 0.4, 0.3, 0.2, and 

is shown. The other 
hyperparameters of the network 



5o"bia S ) onc random function 



were H = 400, cr* 



0.05. 
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y 5 

Output 
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Hidden layer 



Input 




and weights 6f\ idf* 



"j . -"ji . e f ] and w if 
function y(x). I set the hidden unit biases „ , 
Gaussian with zero mean and standard deviation <7b 



weights Wji to random values with standard deviation <j- m ; and the bias and 

(2) (2) 

output weights 0\ and w\a to random values with standard deviation cr out . 

The sort of functions that we obtain depend on the values of (Tbi as , °m 
and (Tout- As the weights and biases are made bigger we obtain more complex 
functions with more features and a greater sensitivity to the input variable. 
The vertical scale of a typical function produced by the network with random 
weights is of order \fH<j ont ] the horizontal range in which the function varies 
significantly is of order cruas/^in] an d the shortest horizontal length scale is of 
order l/a- m . 

Radford Neal (1996) has also shown that in the limit as H — > oo the 
statistical properties of the functions generated by randomizing the weights are 
independent of the number of hidden units; so, interestingly, the complexity of 
the functions becomes independent of the number of parameters in the model. 
What determines the complexity of the typical functions is the characteristic 
magnitude of the weights. Thus we anticipate that when we fit these models to 
real data, an important way of controlling the complexity of the fitted function 
will be to control the characteristic magnitude of the weights. 

Figure 44.4 shows one typical function produced by a network with two 
inputs and one output. This should be contrasted with the function produced 
by a traditional linear regression model, which is a flat plane. Neural networks 
can create functions with more complexity than a linear regression. 



to random values, and plot the resulting 

^ to random values from a 
the input to hidden 



Figure 44.3. Properties of a 
function produced by a random 
network. The vertical scale of a 
typical function produced by the 
network with random weights is of 
order \fHa ou t\ the horizontal 
range in which the function varies 
significantly is of order a"bias/o"in; 
and the shortest horizontal length 
scale is of order 1 / 0"i n . The 
function shown was produced by 
making a random network with 
H = 400 hidden units, and 
Gaussian weights with dbias — 4, 
(T in = 8, and a out = 0.5. 




Figure 44.4. One sample from the 
prior of a two-input network with 
{H, <r&, <r£ ias , <r™ ut } = 
{400,8.0,8.0,0.05}. 



► 44.2 How a regression network is traditionally trained 

This network is trained using a data set D = {x( n ),t(™)} by adjusting w so as 
to minimize an error function, e.g., 

W = jEE - W(x (n) ; w)) 2 • (44.3) 

n i 

This objective function is a sum of terms, one for each input/target pair {x, t}, 
measuring how close the output y(x;w) is to the target t. 

This minimization is based on repeated evaluation of the gradient of Ed. 
This gradient can be efficiently computed using the baekpropagation algorithm 
(Rumelhart et al, 1986), which uses the chain rule to find the derivatives. 
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44.3: Neural network learning as inference 

Often, regularization (also known as weight decay) is included, modifying 
the objective function to: 

M(w) = f3E D + aE w (44.4) 

where, for example, Eyy = \ w f ■ This additional term favours small values 
of w and decreases the tendency of a model to overfit noise in the training 
data. 

Rumelhart et al. (1986) showed that multilayer perceptrons can be trained, 
by gradient descent on M(w), to discover solutions to non-trivial problems 
such as deciding whether an image is symmetric or not. These networks have 
been successfully applied to real-world tasks as varied as pronouncing English 
text (Sejnowski and Rosenberg, 1987) and focussing multiple-mirror telescopes 
(Angel et al, 1990). 



► 44.3 Neural network learning as inference 

The neural network learning process above can be given the following proba- 
bilistic interpretation. [Here we repeat and generalize the discussion of Chap- 
ter 41.] 

The error function is interpreted as defining a noise model. f3Eu is the 
negative log likelihood: 

P(D\w,p,H) = ^^exp(-pE D ). (44.5) 

Thus, the use of the sum-squared error Ed (44.3) corresponds to an assump- 
tion of Gaussian noise on the target variables, and the parameter /3 defines a 
noise level = 1//3. 

Similarly the regularizer is interpreted in terms of a log prior probability 
distribution over the parameters: 

P(w\a,H) = -^—exp(-aE w ). (44.6) 
Z w {a) 

If E\y is quadratic as defined above, then the corresponding prior distribution 
is a Gaussian with variance cr^ = 1 /a. The probabilistic model H specifies 
the architecture A of the network, the likelihood (44.5), and the prior (44.6). 

The objective function M(w) then corresponds to the inference of the 
parameters w, given the data: 

P(w\D,a,P,H) = P{DM H) (447) 

= — exp(-M(w)). (44.8) 

The w found by (locally) minimizing M (w) is then interpreted as the (locally) 
most probable parameter vector, w MP . 

The interpretation of M(w) as a log probability adds little new at this 
stage. But new tools will emerge when we proceed to other inferences. First, 
though, let us establish the probabilistic interpretation of classification net- 
works, to which the same tools apply. 
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Binary classification networks 

If the targets t in a data set are binary classification labels (0, 1), it is natural 
to use a neural network whose output y(x; w, A) is bounded between and 1, 
and is interpreted as a probability P(t=l |x,w,.A). For example, a network 
with one hidden layer could be described by the feedforward equations (44.1) 
and (44.2), with f^ 2 \a) = 1/(1 + e~ a ). The error function f3E D is replaced by 
the negative log likelihood: 



G(w) = 



t (n) In ; w) + (1 - *(">) ln(l - 2/(x(") ; w)) 



(44.9) 



The total objective function is then M = G + aEw- Note that this includes 
no parameter (3 (because there is no Gaussian noise). 



Multi- class classification networks 

For a multi-class classification problem, we can represent the targets by a 
vector, t, in which a single element is set to 1, indicating the correct class, and 
all other elements are set to 0. In this case it is appropriate to use a 'softmax' 
network having coupled outputs which sum to one and are interpreted as 
class probabilities yi = P(tj = l | x, w,„4). The last part of equation (44.2) is 
replaced by: 

yi = Y^- (4410) 

i' 

The negative log likelihood in this case is 

G ( w ) = - EE^ )ln ^( x(n) ; w )- ( 44 - n ) 

n i 

As in the case of the regression network, the minimization of the objective 
function M(w) = G + aE\y corresponds to an inference of the form (44.8). A 
variety of useful results can be built on this interpretation. 



► 44.4 Benefits of the Bayesian approach to supervised feedforward 
neural networks 

From the statistical perspective, supervised neural networks are nothing more 
than nonlinear curve-fitting devices. Curve fitting is not a trivial task however. 
The effective complexity of an interpolating model is of crucial importance, 
as illustrated in figure 44.5. Consider a control parameter that influences the 
complexity of a model, for example a regularization constant a (weight decay 
parameter). As the control parameter is varied to increase the complexity of 
the model (descending from figure 44.5a-c and going from left to right across 
figure 44. 5d), the best fit to the training data that the model can achieve 
becomes increasingly good. However, the empirical performance of the model, 
the test error, first decreases then increases again. An over-complex model 
overfits the data and generalizes poorly. This problem may also complicate 
the choice of architecture in a multilayer perceptron, the radius of the basis 
functions in a radial basis function network, and the choice of the input vari- 
ables themselves in any multidimensional regression problem. Finding values 
for model control parameters that are appropriate for the data is therefore an 
important and non-trivial problem. 
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44.4: Benefits of the Bayesian approach to supervised feedforward neural 



The overfitting problem can be solved by using a Bayesian approach to 
control model complexity. 

If we give a probabilistic interpretation to the model, then we can evaluate 
the evidence for alternative values of the control parameters. As was explained 
in Chapter 28, over-complex models turn out to be less probable, and the 
evidence P(Data | Control Parameters) can be used as an objective function 
for optimization of model control parameters (figure 44. 5e). The setting of a 
that maximizes the evidence is displayed in figure 44.5b. 

Bayesian optimization of model control parameters has four important ad- 
vantages. (1) No 'test set' or 'validation set' is involved, so all available training 
data can be devoted to both model fitting and model comparison. (2) Reg- 
ularization constants can be optimized on-line, i.e., simultaneously with the 
optimization of ordinary model parameters. (3) The Bayesian objective func- 
tion is not noisy, in contrast to a cross-validation measure. (4) The gradient of 
the evidence with respect to the control parameters can be evaluated, making 
it possible to simultaneously optimize a large number of control parameters. 

Probabilistic modelling also handles uncertainty in a natural manner. It 
offers a unique prescription, marginalization, for incorporating uncertainty 
about parameters into predictions; this procedure yields better predictions, as 
we saw in Chapter 41. Figure 44.6 shows error bars on the predictions of a 
trained neural network. 

Implementation of Bayesian inference 

As was mentioned in Chapter 41, Bayesian inference for multilayer networks 
may be implemented by Monte Carlo sampling, or by deterministic methods 
employing Gaussian approximations (Neal, 1996; MacKay, 1992c). 



Figure 44.5. Optimization of 
model complexity. Panels (a-c) 
show a radial basis function model 
interpolating a simple data set 
with one input variable and one 
output variable. As the 
regularization constant is varied 
to increase the complexity of the 
model (from (a) to (c)), the 
intcrpolant is able to fit the 
training data increasingly well, 
but beyond a certain point the 
generalization ability (test error) 
of the model deteriorates. 
Probability theory allows us to 
optimize the control parameters 
without needing a test set. 




Figure 44.6. Error bars on the 
predictions of a trained regression 
network. The solid line gives the 
predictions of the best fit 
parameters of a multilayer 
perceptron trained on the data 
points. The error bars (dotted 
lines) arc those produced by the 
uncertainty of the parameters w. 
Notice that the error bars become 
larger where the data are sparse. 
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Within the Bayesian framework for data modelling, it is easy to improve 
our probabilistic models. For example, if we believe that some input variables 
in a problem may be irrelevant to the predicted quantity, but we don't know 
which, we can define a new model with multiple hyperparameters that captures 
the idea of uncertain input variable relevance (MacKay, 1994b; Neal, 1996; 
MacKay, 1995b); these models then infer automatically from the data which 
are the relevant input variables for a problem. 



► 44.5 Exercises 

Exercise 44.1. ^ How to measure a classifier's quality. You've just written a new 
classification algorithm and want to measure how well it performs on a test 
set, and compare it with other classifiers. What performance measure should 
you use? There are several standard answers. Let's assume the classifier gives 
an output y(x), where x is the input, which we won't discuss further, and that 
the true target value is t. In the simplest discussions of classifiers, both y and 
t are binary variables, but you might care to consider cases where y and t are 
more general objects also. 

The most widely used measure of performance on a test set is the error 
rate - the fraction of misclassifications made by the classifier. This measure 
forces the classifier to give a 0/1 output and ignores any additional information 
that the classifier might be able to offer - for example, an indication of the 
firmness of a prediction. Unfortunately, the error rate does not necessarily 
measure how informative a classifier's output is. Consider frequency tables 
showing the joint frequency of the 0/1 output of a classifier (horizontal axis), 
and the true 0/1 variable (vertical axis). The numbers that we'll show are 
percentages. The error rate e is the sum of the two off-diagonal numbers, 
which we could call the false positive rate e+ and the false negative rate e_. 

Of the following three classifiers, A and B have the same error rate of 10% 
and C has a greater error rate of 12%. 

Classifier A Classifier B Classifier C 



y 





1 


y 





1 


y 





1 


t 






t 






t 









90 








80 


10 





78 


12 


l 


10 





l 





10 


l 





10 



But clearly classifier A, which simply guesses that the outcome is for all 
cases, is conveying no information at all about t; whereas classifier B has an 
informative output: if y = then we are sure that t really is zero; and if y= 1 
then there is a 50% chance that t = l, as compared to the prior probability 
P(t= 1) = 0.1. Classifier C is slightly less informative than B, but it is still 
much more useful than the information-free classifier A. 

One way to improve on the error rate as a performance measure is to report 
the pair (e + , e_), the false positive error rate and the false negative error rate, 
which are (0, 0.1) and (0.1, 0) for classifiers A and B. It is especially important 
to distinguish between these two error probabilities in applications where the 
two sorts of error have different associated costs. However, there are a couple 
of problems with the 'error rate pair': 

• First, if I simply told you that classifier A has error rates (0, 0.1) and B 
has error rates (0.1, 0), it would not be immediately evident that classifier 
A is actually utterly worthless. Surely we should have a performance 
measure that gives the worst possible score to A! 



How common sense ranks the 
classifiers: 

(best) B > C > A (worst). 

How error rate ranks the 
classifiers: 

(best) A = B > C (worst). 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



44.5: Exercises 



533 



• Second, if we turn to a multiple-class classification problem such as digit 
recognition, then the number of types of error increases from two to 
10 x 9 = 90 - one for each possible confusion of class t with t' . It would 
be nice to have some sensible way of collapsing these 90 numbers into a 
single rankable number that makes more sense than the error rate. 

Another reason for not liking the error rate is that it doesn't give a classifier 
credit for accurately specifying its uncertainty. Consider classifiers that have 
three outputs available, '0', T and a rejection class, '?', which indicates that 
the classifier is not sure. Consider classifiers D and E with the following 
frequency tables, in percentages: 



Classifier D 



Classifier E 



y 





? 


1 


y 





? 


1 


t 








t 











74 


10 


6 





78 


6 


6 


l 





1 


9 


l 





5 


5 



Both of these classifiers have (e+, e_, r) = (6%, 0%, 11%). But are they equally 
good classifiers? Compare classifier E with C. The two classifiers are equiva- 
lent. E is just C in disguise - we could make E by taking the output of C and 
tossing a coin when C says T in order to decide whether to give output T or 
'?'. So E is equal to C and thus inferior to B. Now compare D with B. Can 
you justify the suggestion that D is a more informative classifier than B, and 
thus is superior to E? Yet D and E have the same (e + , e_, r) scores. 

People often plot error-reject curves (also known as ROC curves; ROC 
stands for 'receiver operating characteristic') which show the total e = (e+ + 
e_) versus r as r is allowed to vary from to 1, and use these curves to 
compare classifiers (figure 44.7). [In the special case of binary classification 
problems, e + may be plotted versus e_ instead.] But as we have seen, error 
rates can be undiscerning performance measures. Does plotting one error rate 
as a function of another make this weakness of error rates go away? 

For this exercise, either construct an explicit example demonstrating that 
the error-reject curve, and the area under it, are not necessarily good ways to 
compare classifiers; or prove that they are. 

As a suggested alternative method for comparing classifiers, consider the 
mutual information between the output and the target, 



I(T;Y) = H(T) - H(T \ Y) 



Y J P{v)P{t\y)\og 



Pit) 

P(t\y) 7 



(44.12) 



which measures how many bits the classifier's output conveys about the target. 

Evaluate the mutual information for classifiers A-E above. 

Investigate this performance measure and discuss whether it is a useful 
one. Does it have practical drawbacks? 



Error rate 




Rejection rate 



Figure 44.7. An error-reject curve. 
Some people use the area under 
this curve as a measure of 
classifier quality. 
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About Chapter 45 



Feedforward neural networks such as multilayer perceptrons are popular tools 
for nonlinear regression and classification problems. From a Bayesian per- 
spective, a choice of a neural network model can be viewed as defining a prior 
probability distribution over nonlinear functions, and the neural network's 
learning process can be interpreted in terms of the posterior probability dis- 
tribution over the unknown function. (Some learning algorithms search for the 
function with maximum posterior probability and other Monte Carlo methods 
draw samples from this posterior probability.) 

In the limit of large but otherwise standard networks, Neal (1996) has 
shown that the prior distribution over nonlinear functions implied by the 
Bayesian neural network falls in a class of probability distributions known 
as Gaussian processes. The hyperparameters of the neural network model 
determine the characteristic lengthscales of the Gaussian process. Neal's ob- 
servation motivates the idea of discarding parameterized networks and working 
directly with Gaussian processes. Computations in which the parameters of 
the network are optimized are then replaced by simple matrix operations using 
the covariance matrix of the Gaussian process. 

In this chapter I will review work on this idea by Williams and Rasmussen 
(1996), Neal (1997b), Barber and Williams (1997) and Gibbs and MacKay 
(2000), and will assess whether, for supervised regression and classification 
tasks, the feedforward network has been superceded. 

> Exercise 45. 1.^ I regret that this chapter is rather dry. There's no simple 
explanatory examples in it, and few pictures. This exercise asks you to 
create interesting pictures to explain to yourself this chapter's ideas. 

Source code for computer demonstrations written in the free language 
octave is available at: 

http : //www . inference . phy . cam . ac . uk/mackay/ itprnn/ software . html. 

Radford Neal's software for Gaussian processes is available at: 
http : //www . cs . toronto . edu/~radf ord/. 
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After the publication of Rumelhart, Hinton and Williams's (1986) paper on 
supervised learning in neural networks there was a surge of interest in the 
empirical modelling of relationships in high-dimensional data using nonlinear 
parametric models such as multilayer perceptrons and radial basis functions. 
In the Bayesian interpretation of these modelling methods, a nonlinear func- 
tion j/(x) parameterized by parameters w is assumed to underlie the data 
{x'*', t n }n = i, and the adaptation of the model to the data corresponds to an 
inference of the function given the data. We will denote the set of input vectors 
by Xjv = {x( n >}n=i and the set of corresponding target values by the vector 
tjv = {t n }n=i- The inference of y(x) is described by the posterior probability 
distribution 



Of the two terms on the right-hand side, the first, P(tpj |?/(x),Xjv), is the 
probability of the target values given the function y(x), which in the case of 
regression problems is often assumed to be a separable Gaussian distribution; 
and the second term, P(y(x)), is the prior distribution on functions assumed 
by the model. This prior is implicit in the choice of parametric model and 
the choice of regularizers used during the model fitting. The prior typically 
specifies that the function y(x) is expected to be continuous and smooth, 
and has less high frequency power than low frequency power, but the precise 
meaning of the prior is somewhat obscured by the use of the parametric model. 

Now, for the prediction of future values of t, all that matters is the as- 
sumed prior P(y(x)) and the assumed noise model P(tjv | J/(x), Xjv) - the 
parameterization of the function y(x;w) is irrelevant. 

The idea of Gaussian process modelling is to place a prior P(y(x)) directly 
on the space of functions, without parameterizing y(x). The simplest type of 
prior over functions is called a Gaussian process. It can be thought of as the 
generalization of a Gaussian distribution over a finite vector space to a function 
space of infinite dimension. Just as a Gaussian distribution is fully specified 
by its mean and covariance matrix, a Gaussian process is specified by a mean 
and a covariance function. Here, the mean is a function of x (which we will 
often take to be the zero function), and the covariance is a function C(x, x') 
that expresses the expected covariance between the values of the function y 
at the points x and x'. The function y(x) in any one data modelling problem 
is assumed to be a single sample from this Gaussian distribution. Gaussian 
processes are already well established models for various spatial and temporal 
problems - for example, Brownian motion, Langevin processes and Wiener 
processes are all examples of Gaussian processes; Kalman filters, widely used 




Gaussian Processes 



P{y(x)\t N ,X N ) = 



P(t N \y(x),X N )P(y(x)) 
P(t N \X N ) 



(45.1) 
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to model speech waveforms, also correspond to Gaussian process models; the 
method of 'kriging' in geostatistics is a Gaussian process regression method. 



Reservations about Gaussian processes 

It might be thought that it is not possible to reproduce the interesting prop- 
erties of neural network interpolation methods with something so simple as a 
Gaussian distribution, but as we shall now see, many popular nonlinear inter- 
polation methods are equivalent to particular Gaussian processes. (I use the 
term 'interpolation' to cover both the problem of 'regression' - fitting a curve 
through noisy data - and the task of fitting an interpolant that passes exactly 
through the given data points.) 

It might also be thought that the computational complexity of inference 
when we work with priors over infinite-dimensional function spaces might be 
infinitely large. But by concentrating on the joint probability distribution of 
the observed data and the quantities we wish to predict, it is possible to make 
predictions with resources that scale as polynomial functions of N, the number 
of data points. 



► 45.1 Standard methods for nonlinear regression 

The problem 

We are given N data points Xjv,tyv = {x("), t n }^ =1 . The inputs x are vec- 
tors of some fixed input dimension I. The targets t are either real numbers, 
in which case the task will be a regression or interpolation task, or they are 
categorical variables, for example t € {0, 1}, in which case the task is a clas- 
sification task. We will concentrate on the case of regression for the time 
being. 

Assuming that a function y(x) underlies the observed data, the task is to 
infer the function from the given data, and predict the function's value - or 
the value of the observation t^+i - at a new point x' w+1 '. 



Parametric approaches to the problem 

In a parametric approach to regression we express the unknown function y(x) 
in terms of a nonlinear function y(x; w) parameterized by parameters w. 

Example 45.2. Fixed basis functions. Using a set of basis functions {<f>h{ x )}h=i, 
we can write 

H 

j/(x;w) = J2w h cf> h {x). (45.2) 

h=i 

If the basis functions are nonlinear functions of x such as radial basis 
functions centred at fixed points {ch}^ =1 , 



(j) h (x) = exp 



(x - c h ) 



21 



2r 2 



(45.3) 



then y(x; w) is a nonlinear function of x; however, since the dependence 
of y on the parameters w is linear, we might sometimes refer to this as 
a 'linear' model. In neural network terms, this model is like a multilayer 
network whose connections from the input layer to the nonlinear hidden 
layer are fixed; only the output weights w are adaptive. 

Other possible sets of fixed basis functions include polynomials such as 
0/j(x) = x^x'j where p and q are integer powers that depend on h. 
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Example 45.3. Adaptive basis functions. Alternatively, we might make a func- 
tion y(x) from basis functions that depend on additional parameters 
included in the vector w. In a two-layer feedforward neural network 
with nonlinear hidden units and a linear output, the function can be 
written 

y(x; w) = w h ] tanh ( E w hh* + wio) + 4 2) (45-4) 

h=l \i=l ) 



where / is the dimensionality of the input space and the weight vector 
w consists of the input weights {w 
the output weights {w^} and the c 
dependence of y on w is nonlinear. 



w consists of the input weights {w^}, the hidden unit biases {w^}, 
the output weights {w^} and the output bias . In this model, the 



Having chosen the parameterization, we then infer the function y(x; w) by 
inferring the parameters w. The posterior probability of the parameters is 

The factor P(tjv |w,Xjv) states the probability of the observed data points 
when the parameters w (and hence, the function y) are known. This proba- 
bility distribution is often taken to be a separable Gaussian, each data point 
t n differing from the underlying value y(x.^; w) by additive noise. The factor 
P(w) specifies the prior probability distribution of the parameters. This too 
is often taken to be a separable Gaussian distribution. If the dependence of y 
on w is nonlinear the posterior distribution P(w 1 1 n, Xjv) is in general not a 
Gaussian distribution. 

The inference can be implemented in various ways. In the Laplace method, 
we minimize an objective function 

M(w) = -ln[P(tjv |w,Xjv)P(w)] (45.6) 

with respect to w, locating the locally most probable parameters, then use the 
curvature of M, d 2 M(w)/dwidwj, to define error bars on w. Alternatively we 
can use more general Markov chain Monte Carlo techniques to create samples 
from the posterior distribution P(w | tjv, Xjy). 

Having obtained one of these representations of the inference of w given 
the data, predictions are then made by marginalizing over the parameters: 

P(t N+1 \t Nl X N+1 ) = Jd H w P(t N+1 \w,^ N +^)P(w\t N ,X N ). (45.7) 

If we have a Gaussian representation of the posterior P(w 1 1 jv, Xjv), then this 
integral can typically be evaluated directly. In the alternative Monte Carlo 
approach, which generates R samples w( r ) that are intended to be samples 
from the posterior distribution P(w | t^r,Xjv), we approximate the predictive 
distribution by 



1 R 

P(t N+1 \t N ,X N+1 ) --VP(^ +1 |w«,x (Ar+1) ). (45.8) 
R i 
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Nonparametric approaches. 

In nonparametric methods, predictions are obtained without explicitly pa- 
rameterizing the unknown function j/(x); y(x) lives in the infinite-dimensional 
space of all continuous functions of x. One well known nonparametric ap- 
proach to the regression problem is the spline smoothing method (Kimeldorf 
and Wahba, 1970). A spline solution to a one-dimensional regression problem 
can be described as follows: we define the estimator of j/(x) to be the function 
y(x) that minimizes the functional 

i N 1 r 

M(y(x)) = -PJ2(y( x{n) )- t n) 2 +2* I dx[y^(x)} 2 , (45.9) 

n=l 

where 

yip) i s 

the pth derivative of y and p is a positive number. If p is set to 
2 then the resulting function j/(x) is a cubic spline, that is, a piecewise cubic 
function that has 'knots' - discontinuities in its second derivative - at the data 
points {x( n '}. 

This estimation method can be interpreted as a Bayesian method by iden- 
tifying the prior for the function y(x) as: 

\nP(y(x) | a) = ~ a J dx \y (p) {x)} 2 + const, (45.10) 

and the probability of the data measurements tjv = {tn}n=i assuming inde- 
pendent Gaussian noise as: 

1 N 

\nP(t N | y(x),0) = --/3V(|/(x (n) ) - t n ) 2 + const. (45.11) 

1 n=l 

[The constants in equations (45.10) and (45.11) are functions of a and (3 re- 
spectively. Strictly the prior (45. 10) is improper since addition of an arbitrary 
polynomial of degree (p — 1) to y(x) is not constrained. This impropriety is 
easily rectified by the addition of (p— 1) appropriate terms to (45.10).] Given 
this interpretation of the functions in equation (45.9), M(y(x)) is equal to mi- 
nus the log of the posterior probability P{y{x) |tjv,a,/?), within an additive 
constant, and the splines estimation procedure can be interpreted as yielding 
a Bayesian MAP estimate. The Bayesian perspective allows us additionally 
to put error bars on the splines estimate and to draw typical samples from 
the posterior distribution, and it gives an automatic method for inferring the 
hyperparameters a and (3. 



Comments 

Splines priors are Gaussian processes 

The prior distribution defined in equation (45.10) is our first example of a 
Gaussian process. Throwing mathematical precision to the winds, a Gaussian 
process can be defined as a probability distribution on a space of functions 
y(x) that can be written in the form 



P(y(x)\n(x),A) 



1 



■ exp 



1 



(y(x) - n(x)) T A(y(x) - n{x)) 



(45.12) 



where n(x) is the mean function and A is a linear operator, and where the inner 
product of two functions y(x) T z(x) is defined by, for example, J dx y{x)z(x). 
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Here, if we denote by D the linear operator that maps y(x) to the derivative 
of y(x), we can write equation (45.10) as 

If 1 

In P(y(x) | a) = — — a / da; [D p y(x)} 2 + const = — - y(x) J Ay(x) + const, 

(45.13) 

which has the same form as equation (45.12) with fj,(x) = 0, and A = [D P ] T D P . 

In order for the prior in equation (45.12) to be a proper prior, A must be a 
positive definite operator, i.e., one satisfying y(x) J Ay(x) > for all functions 
y(x) other than y(x) = 0. 

Splines can be written as parametric models 

Splines may be written in terms of an infinite set of fixed basis functions, as in 
equation (45.2), as follows. First rescale the x axis so that the interval (0,27r) 
is much wider than the range of x values of interest. Let the basis functions 
be a Fourier set {cos hx,sinhx, h = 0, 1, 2, . . .}, so the function is 

oo oo 

V( x ) = w h(cos) cos(hx) + £ Wfc(sin) sin(hx). (45.14) 

h=0 h=l 

Use the regularizer 

oo 1 oo 1 

EwM = £ ^wl {cos) + £ -h1wl {sin) (45.15) 

h=Q h=l 

to define a Gaussian prior on w, 

^(w | a) = — i— exp(-aE w ). (45.16) 

If p = 2 then we have the cubic splines regularizer E\y(w)= J y^(x) 2 dx, as 
in equation (45.9); if p=l we have the regularizer £V(w)= J y^\x) 2 dx, 
etc. (To make the prior proper we must add an extra regularizer on the 
term wo(cos)-) Thus in terms of the prior P(y(x)) there is no fundamental 
difference between the 'nonparametric' splines approach and other parametric 
approaches. 



Representation is irrelevant for prediction 

From the point of view of prediction at least, there are two objects of inter- 
est. The first is the conditional distribution P{t^ + \ |tjv,Xjv+i) defined in 
equation (45.7). The other object of interest, should we wish to compare one 
model with others, is the joint probability of all the observed data given the 
model, the evidence P(tjv | Xyy), which appeared as the normalizing constant 
in equation (45.5). Neither of these quantities makes any reference to the rep- 
resentation of the unknown function y{x). So at the end of the day, our choice 
of representation is irrelevant. 

The question we now address is, in the case of popular parametric models, 
what form do these two quantities take? We will see that for standard models 
with fixed basis functions and Gaussian distributions on the unknown parame- 
ters, the joint probability of all the observed data given the model, P(t N \ X w ), 
is a multivariate Gaussian distribution with mean zero and with a covariance 
matrix determined by the basis functions; this implies that the conditional 
distribution P(t]y + i |tAr,XAr + i) is also a Gaussian distribution, whose mean 
depends linearly on the values of the targets tjy. Standard parametric models 
are simple examples of Gaussian processes. 
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► 45.2 From parametric models to Gaussian processes 

Linear models 

Let us consider a regression problem using H fixed basis functions, for example 
one-dimensional radial basis functions as defined in equation (45.3). 

Let us assume that a list of N input points {x*-™'} has been specified and 
define the N x H matrix R to be the matrix of values of the basis functions 
{<t>h(x)}h=i at tne points {x„}, 

R nh = <M(xW). (45.17) 

We define the vector yjv to be the vector of values of y(x) at the N points, 

y n = J2 R nhW h . (45.18) 

h 

If the prior distribution of w is Gaussian with zero mean, 

P(w) = Normal(w; 0, a 2 w l), (45.19) 

then y, being a linear function of w, is also Gaussian distributed, with mean 
zero. The covariance matrix of y is 

Q = ( yy T ) = (Rww T R T ) = R (ww T ) R T (45.20) 
= o-2,RR T . (45.21) 

So the prior distribution of y is: 

P(y) = Normal(y; 0, Q) = Normal(y; 0, a 2 w RR T ). (45.22) 

This result, that the vector of N function values y has a Gaussian distribu- 
tion, is true for any selected points Xjy. This is the defining property of a 
Gaussian process. The probability distribution of a function y(x) is a Gaus- 
sian process if for any finite selection of points the density 
P(y(x' 1 '), y(x^ 2 ) ),..., y(x.( N ))) is a Gaussian. 

Now, if the number of basis functions H is smaller than the number of 
data points N, then the matrix Q will not have full rank. In this case the 
probability distribution of y might be thought of as a flat elliptical pancake 
confined to an i/-dimensional subspace in the ^-dimensional space in which 
y lives. 

What about the target values? If each target t n is assumed to differ by 
additive Gaussian noise of variance a 2 from the corresponding function value 
y n then t also has a Gaussian prior distribution, 

P(t) = Normal(t;0,Q + cr 2 I). (45.23) 

We will denote the covariance matrix of t by C: 

C = Q + a 2 ! = o* RR T + all. (45.24) 

Whether or not Q has full rank, the covariance matrix C has full rank since 
all is full rank. 

What does the covariance matrix Q look like? In general, the (n, n') entry 
of Q is 

Q nn , = [a 2 w RRXn' = <4X>(* (n) )Mx< n '>) (45.25) 

h 
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45.2: From parametric models to Gaussian processes 
and the (n, n') entry of C is 

Cnn> = <T= X>(x< n >)0 fc (x< n '>) + Snn'vl (45.26) 
h 

where 5 nn i = l\in = n' and otherwise. 



Example 45.4. Let's take as an example a one-dimensional case, with radial 
basis functions. The expression for Q n n' becomes simplest if we assume we 
have uniformly- spaced basis functions with the basis function labelled h cen- 
tred on the point x = h, and take the limit H — > oo, so that the sum over 
h becomes an integral; to avoid having a covariance that diverges with H, 
we had better make cr 2 scale as S/(AH), where AH is the number of basis 
functions per unit length of the x-axis, and S is a constant; then 



Qr, 



Jh n 



"mm 
f' ^max 

S I dh exp 



dhM^ n) )M^ n>) ) 

(x^ - h) 2 



2r 2 



exp 



(x^ - h) 2 
2^2 



(45.27) 



. (45.28) 



If we let the limits of integration be ±oo, we can solve this integral: 

(a;(n') _a;(n))2" 



Qn 



'irr 2 S exp 



4r 2 



(45.29) 



We are arriving at a new perspective on the interpolation problem. Instead of 
specifying the prior distribution on functions in terms of basis functions and 
priors on parameters, the prior can be summarized simply by a covariance 
function, 

C{x {n \x^) = 6»iexp 

where we have given a new name, 9\, to the constant out front. 

Generalizing from this particular case, a vista of interpolation methods 
opens up. Given any valid covariance function C(x, x') - we'll discuss in 
a moment what 'valid' means - we can define the covariance matrix for N 
function values at locations Xjv to be the matrix Q given by 

Q nn , = C(x< n >,x< n '>) (45.31) 



( x {n')- x (n)y 



4 r 2 



(45.30) 



and the covariance matrix for N corresponding target values, assuming Gaus- 
sian noise, to be the matrix C given by 

C„ tl - = C(xW,xM)+4»'. (45.32) 
In conclusion, the prior probability of the N target values t in the data set is: 
P(t) = Normal(t; 0, C) = ^e-^ ^. (45.33) 

Samples from this Gaussian process and a few other simple Gaussian processes 
are displayed in figure 45.1. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



542 



3.0 




-2.0 1 ■ 1 ■ ^— 1 ■ 

-3.0 -1.0 1.0 3.0 5.0 

(a)2exp(-w) 



4.0 



2.0 



- 0.0 



-2.0 




-4.0 1 1 1 1 

-3.0 -1.0 1.0 3.0 5.0 

(c)2exp(--%^)) 




-4.0 1 ■ 1 ■ 1 ■ 1 ■ 

-3.0 -1.0 1.0 3.0 5.0 

( b ) 2 -p(-|^) 



6.0 




-2.0 



-4.0 1 1 1 1 

-3.0 -1.0 1.0 3.0 5.0 

(d)2cxp(-|=§y) W 



45 — Gaussian Processes 

Figure 45.1. Samples drawn from 
Gaussian process priors. Each 
panel shows two functions drawn 
from a Gaussian process prior. 
The four corresponding covariance 
functions are given below each 
plot. The decrease in lengthscale 
from (a) to (b) produces more 
rapidly fluctuating functions. The 
periodic properties of the 
covariance function in (c) can be 
seen. The covariance function in 
(d) contains the non-stationary 
term xx' corresponding to the 
covariance of a straight line, so 
that typical functions include 
linear trends. From Gibbs (1997). 



Multilayer neural networks and Gaussian processes 

Figures 44.3 and 44.2 show some random samples from the prior distribution 
over functions defined by a selection of standard multilayer perceptrons with 
large numbers of hidden units. Those samples don't seem a million miles away 
from the Gaussian process samples of figure 45.1. And indeed Neal (1996) 
showed that the properties of a neural network with one hidden layer (as 
in equation (45.4)) converge to those of a Gaussian process as the number of 
hidden neurons tends to infinity, if standard 'weight decay' priors are assumed. 
The covariance function of this Gaussian process depends on the details of the 
priors assumed for the weights in the network and the activation functions of 
the hidden units. 



► 45.3 Using a given Gaussian process model in regression 

We have spent some time talking about priors. We now return to our data 
and the problem of prediction. How do we make predictions with a Gaussian 
process? 

Having formed the covariance matrix C defined in equation (45.32) our task 
is to infer t^+i given the observed vector tjy. The joint density P(ijv+i,tjv) 
is a Gaussian; so the conditional distribution 

nt N+ ,\t N )= P{t ^ ) N) (45.34) 

is also a Gaussian. We now distinguish between different sizes of covariance 
matrix C with a subscript, such that C^r+i is the (7V+1) x (N + l) covariance 
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45.4: Examples of covariance functions 

,tN+i) J - We define submatrices of Cat + i 



matrix for the vector tjv+i = (h 
as follows: 



k 



(45.35) 



The posterior distribution (45.34) is given by 



P(tN+i I tjv) oc exp 



2 [*JV t N+ i] C N X +1 



tjv 

tN+l 



(45.36) 



We can evaluate the mean and standard deviation of the posterior distribution 
of ijv+i by brute force inversion of Cjv+i- There is a more elegant expression 
for the predictive distribution, however, which is useful whenever predictions 
are to be made at a number of new points on the basis of the data set of size 
N. We can write in terms of Cat and C^ 1 using the partitioned inverse 

equations (Barnett, 1979): 



r- 1 



M 



m 

m 



where 



m = 


(k - k T C^k) 


m = 


-mC^k 


M = 


+ — mm 7 . 

m 



When we substitute this matrix into equation (45.36) we find 



P{t N +\ | tjv) = ]7 ex P 



2a? 

tN+l 



where 



tN+l 

tN+l 



h T C N l t N 
k - k T C^k. 



(45.37) 



(45.38) 
(45.39) 

(45.40) 



(45.41) 



(45.42) 
(45.43) 



The predictive mean at the new point is given by tjv+i an d a t N+1 defines the 
error bars on this prediction. Notice that we do not need to invert C N+ i in 
order to make predictions at x^ w+1 '. Only Cjv needs to be inverted. Thus 
Gaussian processes allow one to implement a model with a number of basis 
functions H much larger than the number of data points TV", with the com- 
putational requirement being of order A^ 3 , independent of H . [We'll discuss 
ways of reducing this cost later.] 

The predictions produced by a Gaussian process depend entirely on the 
covariance matrix C. We now discuss the sorts of covariance functions one 
might choose to define C, and how we can automate the selection of the 
covariance function in response to data. 



► 45.4 Examples of covariance functions 

The only constraint on our choice of covariance function is that it must gen- 
erate a non-negative-definite covariance matrix for any set of points {^ n }n=i- 
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We will denote the parameters of a covariance function by 0. The covariance 
matrix of t has entries given by 

C mn = C(xW , xW ; 6) + 6 mn N{^n) . e) (45.44) 

where C is the covariance function and M is a noise model which might be 
stationary or spatially varying, for example, 

( #3 for input-independent noise 

A/"(x, 0) exj p(^2j=i /^j ( Aj( x )^ for input-dependent noise. (45.45) 

The continuity properties of C determine the continuity properties of typical 
samples from the Gaussian process prior. An encyclopaedic paper on Gaus- 
sian processes giving many valid covariance functions has been written by 
Abrahamsen (1997). 

Stationary covariance functions 

A stationary covariance function is one that is translation invariant in that it 
satisfies 

C(x,x';0) = £>(x-x';0) (45.46) 

for some function D, i.e., the covariance is a function of separation only, also 
known as the autocovariance function. If additionally C depends only on the 
magnitude of the distance between x and x' then the covariance function is said 
to be homogeneous. Stationary covariance functions may also be described in 
terms of the Fourier transform of the function D, which is known as the power 
spectrum of the Gaussian process. This Fourier transform is necessarily a 
positive function of frequency. One way of constructing a valid stationary 
covariance function is to invent a positive function of frequency and define D 
to be its inverse Fourier transform. 

Example 45.5. Let the power spectrum be a Gaussian function of frequency. 
Since the Fourier transform of a Gaussian is a Gaussian, the autoco- 
variance function corresponding to this power spectrum is a Gaussian 
function of separation. This argument rederives the covariance function 
we derived at equation (45.30). 

Generalizing slightly, a popular form for C with hyperparameters 6 = 

{e 1 ,e 2 ,{r i }) is 



C(x,x';0) = 6qexp 



1 \ - (Xi - xtf 



2 4- rf 



(45.47) 



x is an /-dimensional vector and is a lengthscale associated with input Xi, the 
lengthscale in direction i on which y is expected to vary significantly. A very 
large lengthscale means that y is expected to be essentially a constant function 
of that input. Such an input could be said to be irrelevant, as in the automatic 
relevance determination method for neural networks (MacKay, 1994a; Neal, 
1996). The 6\ hyperparameter defines the vertical scale of variations of a 
typical function. The 62 hyperparameter allows the whole function to be 
offset away from zero by some unknown constant - to understand this term, 
examine equation (45.25) and consider the basis function 0(x) = 1. 
Another stationary covariance function is 



C(x,x') = exp(-\x - x'\») 0<v<2. 



(45.48) 
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Figure 45.2. Multimodal 
likelihood functions for Gaussian 
processes. A data set of five 
points is modelled with the simple 
covariance function (45.47), with 
one hyperparameter #3 controlling 
the noise variance. Panels a and b 
show the most probable 
interpolant and its la error bars 
when the hyperparameters are 
set to two different values that 
(locally) maximize the likelihood 
P(t N \X N ,G): (a) n =0.95, 
3 = 0.0; (b) n = 3.5, 3 = 3.0. 
Panel c shows a contour plot of 
the likelihood as a function of r\ 
and 3 , with the two maxima 
shown by crosses. From Gibbs 
(1997). 



For v = 2, this is a special case of the previous covariance function. For 
v £ (1,2), the typical functions from this prior are smooth but not analytic 
functions. For v < 1 typical functions are continuous but not smooth. 

A covariance function that models a function that is periodic with known 
period in the i th input direction is 



C(x,x';0) = 6>iexp 



2 ^ 

i 



'sin (Z-( Xi 




(45.49) 



Figure 45.1 shows some random samples drawn from Gaussian processes 
with a variety of different covariance functions. 



Nonstationary covariance functions 

The simplest nonstationary covariance function is the one corresponding to a 
linear trend. Consider the plane y(x) = J2i w i x i + c - If tne i w i} an d c nave 
Gaussian distributions with zero mean and variances a\, and <j\ respectively 
then the plane has a covariance function 

/ 

Ciin(x,x';{a w ,o- c }) = ^ o 1 w x i ^ i + o\. (45.50) 

i=i 

An example of random sample functions incorporating the linear term can be 
seen in figure 45. Id. 



► 45.5 Adaptation of Gaussian process models 



Let us assume that a form of covariance function has been chosen, but that it 
depends on undetermined hyperparameters 6. We would like to 'learn' these 
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hyperparameters from the data. This learning process is equivalent to the 
inference of the hyperparameters of a neural network, for example, weight 
decay hyperparameters. It is a complexity-control problem, one that is solved 
nicely by the Bayesian Occam's razor. 

Ideally we would like to define a prior distribution on the hyperparameters 
and integrate over them in order to make our predictions, i.e., we would like 
to find 

P(t N+1 \x N+1 ,V) = J P(t N+1 \x N+1 ,0,V)P(0\V)dO. (45.51) 

But this integral is usually intractable. There are two approaches we can take. 

1. We can approximate the integral by using the most probable values of 
hyperparameters . 

P(t N+1 \x N+1 ,V) ~ P(t N+ i\x N+ i,V,6 MP ) (45.52) 

2. Or we can perform the integration over numerically using Monte Carlo 
methods (Williams and Rasmussen, 1996; Neal, 1997b). 

Either of these approaches is implemented most efficiently if the gradient 
of the posterior probability of can be evaluated. 

Gradient 

The posterior probability of 8 is 

P(0\V) oc P{t N \X N ,0)P(d). (45.53) 
The log of the first term (the evidence for the hyperparameters) is 

In P{t N \X N ,6) = -^lndetCw - ^t&C^tjv - y ln27r, (45.54) 
and its derivative with respect to a hyperparameter 9 is 

l e lnP(t N \X N ,8) = -ilrace (c ff ^) + I^C^C^t*. 

(45.55) 



Comments 

Assuming that finding the derivatives of the priors is straightforward, we can 
now search for 9 MP . However there are two problems that we need to be aware 
of. Firstly, as illustrated in figure 45.2, the evidence may be multimodal. 
Suitable priors and sensible optimization strategies often eliminate poor op- 
tima. Secondly and perhaps most importantly the evaluation of the gradi- 
ent of the log likelihood requires the evaluation of C^ 1 . Any exact inversion 
method (such as Cholesky decomposition, LU decomposition or Gauss-Jordan 
elimination) has an associated computational cost that is of order TV 3 and so 
calculating gradients becomes time consuming for large training data sets. Ap- 
proximate methods for implementing the predictions (equations (45.42) and 
(45.43)) and gradient computation (equation (45.55)) are an active research 
area. One approach based on the ideas of Skilling (1993) makes approxima- 
tions to C _1 t and Trace C -1 using iterative methods with cost 0(N 2 ) (Gibbs 
and MacKay, 1996; Gibbs, 1997). Further references on this topic are given 
at the end of the chapter. 
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45.6: Classification 

► 45.6 Classification 

Gaussian processes can be integrated into classification modelling once we 
identify a variable that can sensibly be given a Gaussian process prior. 

In a binary classification problem, we can define a quantity a n = a(x^ n >) 
such that the probability that the class is 1 rather than is 

P(t n = l\a n ) = 1 (45.56) 
1 + e a ™ 

Large positive values of a correspond to probabilities close to one; large neg- 
ative values of a define probabilities that are close to zero. In a classifica- 
tion problem, we typically intend that the probability P(t n = l) should be a 
smoothly varying function of x. We can embody this prior belief by defining 
a(x) to have a Gaussian process prior. 

Implementation 

It is not so easy to perform inferences and adapt the Gaussian process model 
to data in a classification model as in regression problems because the like- 
lihood function (45.56) is not a Gaussian function of a n . So the posterior 
distribution of a given some observations t is not Gaussian and the normal- 
ization constant P(tjy | Xjy) cannot be written down analytically. Barber and 
Williams (1997) have implemented classifiers based on Gaussian process priors 
using Laplace approximations (Chapter 27). Neal (1997b) has implemented a 
Monte Carlo approach to implementing a Gaussian process classifier. Gibbs 
and MacKay (2000) have implemented another cheap and cheerful approach 
based on the methods of Jaakkola and Jordan (section 33.8). In this varia- 
tional Gaussian process classifier, we obtain tractable upper and lower bounds 
for the unnormalized posterior density over a, P(t^ |a)P(a). These bounds 
are parameterized by variational parameters which are adjusted in order to 
obtain the tightest possible fit. Using normalized versions of the optimized 
bounds we then compute approximations to the predictive distributions. 

Multi-class classification problems can also be solved with Monte Carlo 
methods (Neal, 1997b) and variational methods (Gibbs, 1997). 

► 45.7 Discussion 

Gaussian processes are moderately simple to implement and use. Because very 
few parameters of the model need to be determined by hand (generally only 
the priors on the hyperparameters), Gaussian processes are useful tools for 
automated tasks where fine tuning for each problem is not possible. We do 
not appear to sacrifice any performance for this simplicity. 

It is easy to construct Gaussian processes that have particular desired 
properties; for example we can make a straightforward automatic relevance 
determination model. 

One obvious problem with Gaussian processes is the computational cost 
associated with inverting an N x N matrix. The cost of direct methods of 
inversion becomes prohibitive when the number of data points N is greater 
than about 1000. 

Have we thrown the baby out with the bath water? 

According to the hype of 1987, neural networks were meant to be intelligent 
models that discovered features and patterns in data. Gaussian processes in 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 

548 45 — Gaussian Processes 



contrast are simply smoothing devices. How can Gaussian processes possi- 
bly replace neural networks? Were neural networks over-hyped, or have we 
underestimated the power of smoothing methods? 

I think both these propositions are true. The success of Gaussian processes 
shows that many real-world data modelling problems are perfectly well solved 
by sensible smoothing methods. The most interesting problems, the task of 
feature discovery for example, are not ones that Gaussian processes will solve. 
But maybe multilayer perceptrons can't solve them either. Perhaps a fresh 
start is needed, approaching the problem of machine learning from a paradigm 
different from the supervised feedforward mapping. 

Further reading 

The study of Gaussian processes for regression is far from new. Time series 
analysis was being performed by the astronomer T.N. Thiele using Gaussian 
processes in 1880 (Lauritzen, 1981). In the 1940s, Wiener-Kolmogorov pre- 
diction theory was introduced for prediction of trajectories of military targets 
(Wiener, 1948). Within the geostatistics field, Matheron (1963) proposed a 
framework for regression using optimal linear estimators which he called 'krig- 
ing' after D.G. Krige, a South African mining engineer. This framework is 
identical to the Gaussian process approach to regression. Kriging has been 
developed considerably in the last thirty years (see Cressie (1993) for a re- 
view) including several Bayesian treatments (Omre, 1987; Kitanidis, 1986). 
However the geostatistics approach to the Gaussian process model has con- 
centrated mainly on low-dimensional problems and has largely ignored any 
probabilistic interpretation of the model. Kalman filters are widely used to 
implement inferences for stationary one-dimensional Gaussian processes, and 
are popular models for speech and music modelling (Bar-Shalom and Fort- 
mann, 1988). Generalized radial basis functions (Poggio and Girosi, 1989), 
ARMA models (Wahba, 1990) and variable metric kernel methods (Lowe, 
1995) are all closely related to Gaussian processes. See also O'Hagan (1978). 

The idea of replacing supervised neural networks by Gaussian processes 
was first explored by Williams and Rasmussen (1996) and Neal (1997b). A 
thorough comparison of Gaussian processes with other methods such as neural 
networks and MARS was made by Rasmussen (1996). Methods for reducing 
the complexity of data modelling with Gaussian processes remain an active 
research area (Poggio and Girosi, 1990; Luo and Wahba, 1997; Tresp, 2000; 
Williams and Seeger, 2001; Smola and Bartlett, 2001; Rasmussen, 2002; Seeger 
et al, 2003; Opper and Winther, 2000). 

A longer review of Gaussian processes is in (MacKay, 1998b). A review 
paper on regression with complexity control using hierarchical Bayesian models 
is (MacKay, 1992a). 

Gaussian processes and support vector learning machines (Scholkopf et al, 
1995; Vapnik, 1995) have a lot in common. Both are kernel-based predictors, 
the kernel being another name for the covariance function. A Bayesian version 
of support vectors, exploiting this connection, can be found in (Chu et al, 
2001; Chu et al, 2002; Chu et al, 2003b; Chu et al, 2003a). 
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46 



Deconvolution 



► 46.1 Traditional image reconstruction methods 

Optimal linear filters 

In many imaging problems, the data measurements {d n } are linearly related 
to the underlying image f : 

d n = E Rnkfk + n n . (46.1) 

k 

The vector n denotes the inevitable noise that corrupts real data. In the case 
of a camera which produces a blurred picture, the vector f denotes the true 
image, d denotes the blurred and noisy picture, and the linear operator R 
is a convolution defined by the point spread function of the camera. In this 
special case, the true image and the data vector reside in the same space; 
but it is important to maintain a distinction between them. We will use the 
subscript n = 1, ...,7V to run over data measurements, and the subscripts 
k,k' = 1, . . . , K to run over image pixels. 

One might speculate that since the blur was created by a linear operation, 
then perhaps it might be deblurred by another linear operation. We can derive 
the optimal linear filter in two ways. 



Bayesian derivation 

We assume that the linear operator R is known, and that the noise n is 
Gaussian and independent, with a known standard deviation a v . 

P(d \f,a v ,H) = {2 J 2)N/2 exp (- E (d n - J2 k Rnkfkf/ (2^ . (46.2) 

We assume that the prior probability of the image is also Gaussian, with a 
scale parameter <jf. 

P(i \a f ,H)= ^\\k,2 ex P I " E hCwfU ( 2 °f) ) • ( 46 - 3 ) 



k,k' 



If we assume no correlations among the pixels then the symmetric, full rank 
matrix C is equal to the identity matrix I. The more sophisticated 'intrinsic 
correlation function' model uses C = [GG T ] _1 , where G is a convolution that 
takes us from an imaginary 'hidden' image, which is uncorrelated, to the real 
correlated image. The intrinsic correlation function should not be confused 
with the point spread function R which defines the image-to-data mapping. 
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A zero-mean Gaussian prior is clearly a poor assumption if it is known that 
all elements of the image f are positive, but let us proceed. We can now write 
down the posterior probability of an image f given the data d. 



In words, 



P(f\d,a„,(T f ,H) 



Posterior 



P(d|f,^,^)P(f \<Tf,H)) 

P(d\<r v ,a f ,H) 
Likelihood x Prior 



(46.4) 



(46.5) 



Evidence 

The 'evidence' P(d | cr u , crj, Ti) is the normalizing constant for this posterior 
distribution. Here it is unimportant, but it is used in a more sophisticated 
analysis to compare, for example, different values of o v and a/, or different 
point spread functions R. 

Since the posterior distribution is the product of two Gaussian functions of 
f , it is also a Gaussian, and can therefore be summarized by its mean, which 
is also the most probable image, f M p, and its covariance matrix: 



] f | d = [-VVl0gP(f|d,<7 v ,<7 /) W)] 



-1 



(46.6) 



which defines the joint error bars on f . In this equation, the symbol V denotes 
differentiation with respect to the image parameters f . We can find f MP by 
differentiating the log of the posterior, and solving for the derivative being 
zero. We obtain: 



R T R + — ^-C 



-l 



R T d. 



(46.7) 



The operator 



RR 



R T is called the optimal linear filter. When 



the term %C can be neglected, the optimal linear filter is the pseudoinverse 

a f 

[R T R] _1 R T . The term %C regularizes this ill-conditioned inverse. 
The optimal linear filter can also be manipulated into the form: 



Optimal linear filter = C R T 



RC *R T 



(46.8) 



Minimum square error derivation 

The non-Bayesian derivation of the optimal linear filter starts by assuming 
that we will 'estimate' the true image f by a linear function of the data: 



f = Wd. 



(46.9) 



The linear operator W is then 'optimized' by minimizing the expected sum- 
squared error between f and the unknown true image . In the following equa- 
tions, summations over repeated indices k, k' , n are implicit. The expectation 
(•) is over both the statistics of the random variables {n„}, and the ensemble 
of images f which we expect to bump into. We assume that the noise is zero 
mean and uncorrelated to second order with itself and everything else, with 
{n n n n i) = ol5 nn >. 



(E) 



{w kn d n - f k y 



{WknRnjfj — fkY 



1. 



:W kn W kn a v . 



(46.10) 
(46.11) 
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46.1: Traditional image reconstruction methods 

Differentiating with respect to W, and introducing F = (fj'fj) (cf. crjC- 1 in 
the Bayesian derivation above), we find that the optimal linear filter is: 

W opt = FR7 [RFR7 + o*i\ _1 . (46.12) 

If we identify F = cr^C -1 , we obtain the optimal linear filter (46.8) of the 
Bayesian derivation. The ad hoc assumptions made in this derivation were the 
choice of a quadratic error measure, and the decision to use a linear estimator. 
It is interesting that without explicit assumptions of Gaussian distributions, 
this derivation has reproduced the same estimator as the Bayesian posterior 
mode, f MP - 

The advantage of a Bayesian approach is that we can criticize these as- 
sumptions and modify them in order to make better reconstructions. 

Other image models 

The better matched our model of images P(f | TL) is to the real world, the bet- 
ter our image reconstructions will be, and the less data we will need to answer 
any given question. The Gaussian models which lead to the optimal linear 
filter are spectacularly poorly matched to the real world. For example, the 
Gaussian prior (46.3) fails to specify that all pixel intensities in an image are 
positive. This omission leads to the most pronounced artefacts where the im- 
age under observation has high contrast or large black patches. Optimal linear 
filters applied to astronomical data give reconstructions with negative areas in 
them, corresponding to patches of sky that suck energy out of telescopes! The 
maximum entropy model for image deconvolution (Gull and Daniell, 1978) 
was a great success principally because this model forced the reconstructed 
image to be positive. The spurious negative areas and complementary spu- 
rious positive areas are eliminated, and the quality of the reconstruction is 
greatly enhanced. 

The 'classic maximum entropy' model assigns an entropic prior 

P(f | a, m, Classic) = exp(aS(f , m))/Z, (46.13) 

where 

5(f,m) = ^(/iln(m,//,) + /, - m») (46.14) 

i 

(Skilling, 1989). This model enforces positivity; the parameter a defines a 
characteristic dynamic range by which the pixel values are expected to differ 
from the default image m. 

The 'intrinsic-correlation-function maximum-entropy' model (Gull, 1989) 
introduces an expectation of spatial correlations into the prior on f by writing 
f = Gh, where G is a convolution with an intrinsic correlation function, and 
putting a classic maxent prior on the underlying hidden image h. 

Probabilistic movies 

Having found not only the most probable image f MP but also error bars on 
it, £f|d, one task is to visualize those error bars. Whether or not we use 
Monte Carlo methods to infer f , a correlated random walk around the posterior 
distribution can be used to visualize the uncertainties and correlations. For 
a Gaussian posterior distribution, we can create a correlated sequence of unit 
normal random vectors n using 



n (t+i) = cn (t) + SZ; 



(46.15) 
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where z is a unit normal random vector and c 2 + s 2 = 1 (c controls how 
persistent the memory of the sequence is). We then render the image sequence 
defined by 

f W = f MP + EjgnW (46.16) 

1/2 

where S f | d is the Cholesky decomposition of Ef^. 

► 46.2 Supervised neural networks for image deconvolution 

Neural network researchers often exploit the following strategy Given a prob- 
lem currently solved with a standard algorithm: interpret the computations 
performed by the algorithm as a parameterized mapping from an input to an 
output, and call this mapping a neural network; then adapt the parameters 
to data so as to produce another mapping that solves the task better. By 
construction, the neural network can reproduce the standard algorithm, so 
this data-driven adaptation can only make the performance better. 

There are several reasons why standard algorithms can be bettered in this 
way. 

1. Algorithms are often not designed to optimize the real objective func- 
tion. For example, in speech recognition, a hidden Markov model is 
designed to model the speech signal, and is fitted so as to to maximize 
the generative probability given the known string of words in the training 
data; but the real objective is to discriminate between different words. 
If an inadequate model is being used, the neural-net-style training of 
the model will focus the limited resources of the model on the aspects 
relevant to the discrimination task. Discriminative training of hidden 
Markov models for speech recognition does improve their performance. 

2. The neural network can be more flexible than the standard model; some 
of the adaptive parameters might have been viewed as fixed features by 
the original designers. A flexible network can find properties in the data 
that were not included in the original model. 



► 46.3 Deconvolution in humans 

A huge fraction of our brain is devoted to vision. One of the neglected features 
of our visual system is that the raw image falling on the retina is severely 
blurred: while most people can see with a resolution of about 1 arcminute 
(one sixtieth of a degree) under any daylight conditions, bright or dim, the 
image on our retina is blurred through a point spread function of width as 
large as 5 arcminutes (Wald and Griffin, 1947; Howarth and Bradley, 1986). 
It is amazing that we are able to resolve pixels that are twenty-five times 
smaller in area than the blob produced on our retina by any point source. 

Isaac Newton was aware of this conundrum. It's hard to make a lens that 
does not have chromatic aberration, and our cornea and lens, like a lens made 
of ordinary glass, refract blue light more strongly than red. Typically our eyes 
focus correctly for the middle of the visible spectrum (green), so if we look 
at a single white dot made of red, green, and blue light, the image on our 
retina consists of a sharply focussed green dot surrounded by a broader red 
blob superposed on an even broader blue blob. The width of the red and blue 
blobs is proportional to the diameter of the pupil, which is largest under dim 
lighting conditions. [The blobs are roughly concentric, though most people 
have a slight bias, such that in one eye the red blob is centred a tiny distance 
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to the left and the blue is centred a tiny distance to the right, and in the other 
eye it's the other way round. This slight bias explains why when we look 
at blue and red writing on a dark background most people perceive the blue 
writing to be at a slightly greater depth than the red. In a minority of people, 
this small bias is the other way round and the red/blue depth perception is 
reversed. But this effect (which many people are aware of, having noticed it 
in cinemas, for example) is tiny compared with the chromatic aberration we 
are discussing.] 

You can vividly demonstrate to yourself how enormous the chromatic aber- 
ration in your eye is with the help of a sheet of card and a colour computer 
screen. 

For the most impressive results - I guarantee you will be amazed - use 
a dim room with no light apart from the computer screen; a pretty strong 
effect will still be seen even if the room has daylight coming into it, as long as 
it is not bright sunshine. Cut a slit about 1.5 mm wide in the card. On the 
screen, display a few small coloured objects on a black background. I especially 
recommend thin vertical objects coloured pure red, pure blue, magenta (i.e., 
red plus blue), and white (red plus blue plus green). 1 Include a little black- 
and-white text on the screen too. Stand or sit sufficiently far away that you 
can only just read the text - perhaps a distance of four metres or so, if you have 
normal vision. Now, hold the slit vertically in front of one of your eyes, and 
close the other eye. Hold the slit near to your eye - brushing your eyelashes - 
and look through it. Waggle the slit slowly to the left and to the right, so that 
the slit is alternately in front of the left and right sides of your pupil. What 
do you see? I see the red objects waggling to and fro, and the blue objects 
waggling to and fro, through huge distances and in opposite directions, while 
white objects appear to stay still and are negligibly distorted. Thin magenta 
objects can be seen splitting into their constituent red and blue parts. Measure 
how large the motion of the red and blue objects is - it's more than 5 minutes 
of arc for me, in a dim room. Then check how sharply you can see under these 
conditions - look at the text on the screen, for example: is it not the case that 
you can see (through your whole pupil) features far smaller than the distance 
through which the red and blue components were waggling? Yet when you are 
using the whole pupil, what is falling on your retina must be an image blurred 
with a blurring diameter equal to the waggling amplitude. 

One of the main functions of early visual processing must be to deconvolve 
this chromatic aberration. Neuroscientists sometimes conjecture that the rea- 
son why retinal ganglion cells and cells in the lateral geniculate nucleus (the 
main brain area to which retinal ganglion cells project) have centre-surround 
receptive fields with colour opponency (long wavelength in the centre and 
medium wavelength in the surround, for example) is in order to perform 'fea- 
ture extraction' or 'edge detection', but I think this view is mistaken. The 
reason we have centre-surround filters at the first stage of visual processing 
(in the fovea at least) is for the huge task of deconvolution of chromatic aber- 
ration. 

I speculate that the McCollough effect, an extremely long-lasting associ- 
ation of colours with orientation (McCollough, 1965; MacKay and MacKay, 
1974), is produced by the adaptation mechanism that tunes our chromatic- 
aberration-deconvolution circuits. Our deconvolution circuits must be rapidly 
tuneable, because the point spread function of our eye changes with our pupil 
diameter, which can change within seconds; and indeed the McCollough effect 
can be induced within 30 seconds. At the same time, the effect is long-lasting 
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when an eye is covered, because it's in our interests that our deconvolution 
circuits should stay well-tuned while we sleep, so that we can see sharply the 
instant we wake up. 

I also wonder whether the main reason that we evolved colour vision was 
not 'in order to see fruit better' but 'so as to be able to see black and white 
sharper' - deconvolving chromatic aberration is easier, even in an entirely black 
and white world, if one has access to chromatic information in the image. 

And a final speculation: why do our eyes make micro-saccades when we 
look at things? These miniature eye-movements are of an angular size big- 
ger than the spacing between the cones in the fovea (which are spaced at 
roughly 1 minute of arc, the perceived resolution of the eye). The typical 
size of a microsaccade is 5-10 minutes of arc (Ratliff and Riggs, 1950). Is it a 
coincidence that this is the same as the size of chromatic aberration? Surely 
micro-saccades must play an essential role in the deconvolution mechanism 
that delivers our high-resolution vision. 

► 46.4 Exercises 

Exercise 46. 1.^ Blur an image with a circular (top hat) point spread function 
and add noise. Then deconvolve the blurry noisy image using the optimal 
linear filter. Find error bars and visualize them by making a probabilistic 
movie. 
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Sparse Graph Codes 
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About Part VI 



The central problem of communication theory is to construct an encoding 
and a decoding system that make it possible to communicate reliably over 
a noisy channel. During the 1990s, remarkable progress was made towards 
the Shannon limit, using codes that are defined in terms of sparse random 
graphs, and which are decoded by a simple probability-based message-passing 
algorithm. 

In a sparse-graph code, the nodes in the graph represent the transmitted 
bits and the constraints they satisfy. For a linear code with a codeword length 
N and rate R = K/N, the number of constraints is of order M = N — K. 
Any linear code can be described by a graph, but what makes a sparse-graph 
code special is that each constraint only involves a small number of variables 
in the graph: so the number of edges in the graph scales roughly linearly with 
7V, rather than quadratically. 

In the following four chapters we will look at four families of sparse-graph 
codes: three families that are excellent for error-correction: low-density parity- 
check codes, turbo codes, and repeat-accumulate codes; and the family of 
digital fountain codes, which are outstanding for erasure-correction. 

All these codes can be decoded by a local message-passing algorithm on the 
graph, the sum-product algorithm, and, while this algorithm is not a perfect 
maximum likelihood decoder, the empirical results are record-breaking. 
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47 



Low-Density Parity-Check Codes 



A low-density parity-check code (or Gallager code) is a block code that has a 
parity-check matrix, H, every row and column of which is 'sparse'. 

A regular Gallager code is a low-density parity-check code in which every 
column of H has the same weight j and every row has the same weight k; reg- 
ular Gallager codes are constructed at random subject to these constraints. A 
low-density parity-check code with j = 3 and k = 4 is illustrated in figure 47.1. 
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► 47.1 Theoretical properties 

Low-density parity-check codes lend themselves to theoretical study. The fol- 
lowing results are proved in Gallager (1963) and MacKay (1999b). 

Low-density parity-check codes, in spite of their simple construction, are 
good codes, given an optimal decoder (good codes in the sense of section 11.4). 
Furthermore, they have good distance (in the sense of section 13.2). These two 
results hold for any column weight j > 3. Furthermore, there are sequences of 
low-density parity-check codes in which j increases gradually with 7V, in such 
a way that the ratio j/N still goes to zero, that are very good and that have 
very good distance. 

However, we don't have an optimal decoder, and decoding low-density 
parity-check codes is an NP-complete problem. So what can we do in practice? 



Figure 47.1. A low-density 
parity-check matrix and the 
corresponding graph of a rate- Vi 
low-density parity-check code 
with blocklength N = 16, and 
M = 12 constraints. Each white 
circle represents a transmitted bit. 
Each bit participates in j ' = 3 
constraints, represented by 
squares. Each constraint forces 
the sum of the k = 4 bits to which 
it is connected to be even. 



► 47.2 Practical decoding 

Given a channel output r, we wish to find the codeword t whose likelihood 
P(r 1 1) is biggest. All the effective decoding strategies for low-density parity- 
check codes are message-passing algorithms. The best algorithm known is 
the sum-product algorithm, also known as iterative probabilistic decoding or 
belief propagation. 

We'll assume that the channel is a memoryless channel (though more com- 
plex channels can easily be handled by running the sum-product algorithm 
on a more complex graph that represents the expected correlations among the 
errors (Worthen and Stark, 1998)). For any memoryless channel, there are 
two approaches to the decoding problem both of which lead to the generic 
problem 'find the x that maximizes 

P*(x) =P(x)l[Hx = z]', (47.1) 

where P(x) is a separable distribution on a binary vector x, and z is another 
binary vector. Each of these two approaches represents the decoding problem 
in terms of a factor graph (Chapter 26). 
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a a a a a" a a a a a a a 




(a) The prior distribution over codewords 

P(t) oc l[Ht = 0]. 

The variable nodes are the transmitted bits {t n }. 

Each node represents the factor lE n6 jv( m ) i n = 0mod2]. 

P(r n I t n ) ( b ) The posterior distribution over codewords, 

in 

P(t|r) oc P(t)P(r|t). 
Each upper function node represents a likelihood factor P(r n | t n ) 



P(n„ 



(c) The joint probability of the noise n and syndrome z, 

P(n,z) = P(n)l[z = Hn]. 

The top variable nodes are now the noise bits {n n }. 

The added variable nodes at the base are the syndrome values 

{z m }- 

Each definition z m = Y2 n Hmnn n mod 2 is enforced by a S factor. 



The codeword decoding viewpoint 

First, we note that the prior distribution over codewords, 

P(t) oc l[Ht = 0mod2], 



Figure 47.2. Factor graphs 
associated with a low-density 
parity-check code. 



(47.2) 



can be represented by a factor graph (figure 47.2a), with the factorization 
being 

P(t) oc Y[t[ ^t„ = 0mod2]. (47.3) 

(We'll omit the 'mod 2's from now on.) The posterior distribution over code- 
words is given by multiplying this prior by the likelihood, which introduces 
another TV factors, one for each received bit. 



P(t|r) oc P(t)P(r|t) 

I ^n) 



(47.4) 



m neH(m) 



The factor graph corresponding to this function is shown in figure 47.2b. It 
is the same as the graph for the prior, except for the addition of likelihood 
'dongles' to the transmitted bits. 

In this viewpoint, the received signal r n can live in any alphabet; all that 
matters are the values of P(r n \t n ). 



The syndrome decoding viewpoint 

Alternatively, we can view the channel output in terms of a binary received 
vector r and a noise vector n, with a probability distribution P(n) that can 
be derived from the channel properties and whatever additional information 
is available at the channel outputs. 

For example, with a binary symmetric channel, we define the noise by 
r = t + n, the syndrome z = Hr, and noise model P(n n ) = f. For other 
channels such as the Gaussian channel with output y, we may define a received 
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47.3: Decoding with the sum-product algorithm 

binary vector r however we wish and obtain an effective binary noise model 
P(n) from y (exercises 9.18 (p. 155) and 25.1 (p. 325)). 

The joint probability of the noise n and syndrome z = Hn can be factored 

as 

P(n,z) = P(n)l[z = Hn] 

= l[P(nn) ]Jt[z m = 5>n]- (47.5) 

n m neAf(m) 

The factor graph of this function is shown in figure 47.2c. The variables n 
and z can also be drawn in a 'belief network' (also known as a 'Bayesian 
network', 'causal network', or 'influence diagram') similar to figure 47.2a, but 
with arrows on the edges from the upper circular nodes (which represent the 
variables n) to the lower square nodes (which now represent the variables z). 
We can say that every bit x n is the parent of j checks z m , and each check z m 
is the child of k bits. 

Both decoding viewpoints involve essentially the same graph. Either ver- 
sion of the decoding problem can be expressed as the generic decoding problem 
'find the x that maximizes 

P*(x) = P(x)l[Hx = z]'; (47.6) 

in the codeword decoding viewpoint, x is the codeword t, and z is 0; in the 
syndrome decoding viewpoint, x is the noise n, and z is the syndrome. 

It doesn't matter which viewpoint we take when we apply the sum-product 
algorithm. The two decoding algorithms are isomorphic and will give equiva- 
lent outcomes (unless numerical errors intervene). 

I tend to use the syndrome decoding viewpoint because it has one advantage: 
one does not need to implement an encoder for a code in order to be able to 
simulate a decoding problem realistically. 

We'll now talk in terms of the generic decoding problem. 

► 47.3 Decoding with the sum-product algorithm 

We aim, given the observed checks, to compute the marginal posterior proba- 
bilities P(x n = 1 | z, H) for each n. It is hard to compute these exactly because 
the graph contains many cycles. However, it is interesting to implement the 
decoding algorithm that would be appropriate if there were no cycles, on the 
assumption that the errors introduced might be relatively small. This ap- 
proach of ignoring cycles has been used in the artificial intelligence literature 
but is now frowned upon because it produces inaccurate probabilities. How- 
ever, if we are decoding a good error-correcting code, we don't care about 
accurate marginal probabilities - we just want the correct codeword. Also, 
the posterior probability, in the case of a good code communicating at an 
achievable rate, is expected typically to be hugely concentrated on the most 
probable decoding; so we are dealing with a distinctive probability distribution 
to which experience gained in other fields may not apply 

The sum-product algorithm was presented in Chapter 26. We now write 
out explicitly how it works for solving the decoding problem 

Hx = z (mod 2). 

For brevity, we reabsorb the dongles hanging off the x and z nodes in fig- 
ure 47.2c and modify the sum-product algorithm accordingly. The graph in 
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which x and z live is then the original graph (figure 47.2a) whose edges are 
defined by the Is in H. The graph contains nodes of two types, which we'll 
call checks and bits. The graph connecting the checks and bits is a bipartite 
graph: bits connect only to checks, and vice versa. On each iteration, a prob- 
ability ratio is propagated along each edge in the graph, and each bit node x n 
updates its probability that it should be in state 1. 

We denote the set of bits n that participate in check m by Af{m) = {n : 
H m n = l}- Similarly we define the set of checks in which bit n participates, 
A4(n) = {m : H mn =l}. We denote a set Af(m) with bit n excluded by 
Af(m)\n. The algorithm has two alternating parts, in which quantities q mn 
and r mn associated with each edge in the graph are iteratively updated. The 
quantity q^ nn is meant to be the probability that bit n of x has the value x, 
given the information obtained via checks other than check m. The quantity 
r mn is meant to be the probability of check m being satisfied if bit n of x is 
considered fixed at x and the other bits have a separable distribution given 
by the probabilities {q mn ' ■ n' G N(m)\n}. The algorithm would produce the 
exact posterior probabilities of all the bits after a fixed number of iterations 
if the bipartite graph defined by the matrix H contained no cycles. 



Initialization. Let p Q n = P(x n = 0) (the prior probability that bit x n is 0), 
and let p\ = P(x n = l) = 1 - p°. If we are taking the syndrome decoding 
viewpoint and the channel is a binary symmetric channel then p\ will equal 
/. If the noise level varies in a known way (for example if the channel is a 
binary input Gaussian channel with a real output) then p\ is initialized to the 
appropriate normalized likelihood. For every (n, in) such that H mn = 1 the 
variables q mn and g^ n are initialized to the values p^ and p n respectively. 



Horizontal step. In the horizontal step of the algorithm (horizontal from 
the point of view of the matrix H), we run through the checks m and compute 
for each n G JV(m) two probabilities: first, r mn , the probability of the observed 
value of z m arising when x n = 0, given that the other bits {x n i : n' ^ n} have 
a separable distribution given by the probabilities {q mn i, g^,,,}, defined by: 

r° mn = E P ( Z ™ I X n = °> K' : e ^(m)\n}) n q^, 

{x n r.n'eN(m)\n} n'eH(m)\n 

(47.7) 

and second, r^ n , the probability of the observed value of z m arising when 
x n = 1, defined by: 

r 1 m „= Yl P{z m \x n = l, {x n , : n' G N{m)\n}) \{ 

{x n /:n'eAf(m)\n} n'eH(m)\n 

(47.8) 

The conditional probabilities in these summations are either zero or one, de- 
pending on whether the observed z m matches the hypothesized values for x n 
and the {x n i}. 

These probabilities can be computed in various obvious ways based on 
equation (47.7) and (47.8). The computations may be done most efficiently (if 
\N(m) | is large) by regarding z m + x n as the final state of a Markov chain with 
states and 1, this chain being started in state 0, and undergoing transitions 
corresponding to additions of the various x n i, with transition probabilities 
given by the corresponding q® mn , and q mn i- The probabilities for z m having its 
observed value given either x n = or x n = 1 can then be found efficiently by 
use of the forward-backward algorithm (section 25.3). 
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47.3: Decoding with the sum-product algorithm 

A particularly convenient implementation of this method uses forward and 
backward passes in which products of the differences 5q mn = q^ nn — q\ nn are 
computed. We obtain Sr mn = — r\ nn from the identity: 

5r mn = (-l) z ™ n 6q mn> . (47.9) 

n'eH(m)\n 

This identity is derived by iterating the following observation: if £ = x^ + 

2Vmod2, and x^ and x u have probabilities q^q% and q^, q\ of being and 1, 

then P(C = 1) = q\ql + <Z°?J and P(C = 0) = q°q° v + q^l Thus P(C = 0) - 

P(C=l) = (q°,-q 1 M-qD- 
We recover r% ln and r} nn using 

r° mn = V2(l + 6r mn ), r l mn = V2{l-5r mn ). (47.10) 

The transformations into differences Sq and back from Sr to {r} may be viewed 
as a Fourier transform and an inverse Fourier transformation. 

Vertical step. The vertical step takes the computed values of r^ nn and r^ nn 
and updates the values of the probabilities q^n an d Qmn- For each n we 
compute: 

q°mn = ^mnVl \{ r° m , n (47.11) 
m'eM(n)\m 

<Ln = a mnP i Y[ ^m'n (47.12) 
m'eM(n)\m 

where a mn is chosen such that qmn+lmn = 1- These products can be efficiently 
computed in a downward pass and an upward pass. 

We can also compute the 'pseudoposterior probabilities' q^ and q^ at this 
iteration, given by: 

Qn = 0!nP° n Yi r ™«' (47.13) 
meM(n) 

In = a nPl II r ™»' ( 47J4 ) 
m£M(n) 

These quantities are used to create a tentative decoding x, the consistency 
of which is used to decide whether the decoding algorithm can halt. (Halt if 
Hx = z.) 

At this point, the algorithm repeats from the horizontal step. 

The stop-when-it's-done decoding method. The recommended decod- 
ing procedure is to set x n to 1 if q\ > 0.5 and see if the checks Hx = z mod 2 are 
all satisfied, halting when they are, and declaring a failure if some maximum 
number of iterations (e.g. 200 or 1000) occurs without successful decoding. In 
the event of a failure, we may still report x, but we flag the whole block as a 
failure. 

We note in passing the difference between this decoding procedure and 
the widespread practice in the turbo code community, where the decoding 
algorithm is run for a fixed number of iterations (irrespective of whether the 
decoder finds a consistent state at some earlier time). This practice is wasteful 
of computer time, and it blurs the distinction between undetected and detected 
errors. In our procedure, 'undetected' errors occur if the decoder finds an x 
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Figure 47.3. Demonstration of encoding with a rate-1/2 Gallagcr code. The encoder is derived from 
a very sparse 10 000 x 20 000 parity-check matrix with three Is per column (figure 47.4). 
(a) The code creates transmitted vectors consisting of 10 000 source bits and 10 000 parity- 
check bits, (b) Here, the source sequence has been altered by changing the first bit. Notice 
that many of the parity-check bits arc changed. Each parity bit depends on about half of 
the source bits, (c) The transmission for the case s = (1, 0, 0, .... 0). This vector is the 
difference (modulo 2) between transmissions (a) and (b). [Dilbert image Copyright©1997 
United Feature Syndicate, Inc., used with permission.] 



satisfying Hie = z mod 2 which is not equal to the true x. 'Detected' errors 
occur if the algorithm runs for the maximum number of iterations without 
finding a valid decoding. Undetected errors are of scientific interest because 
they reveal distance properties of a code. And in engineering practice, it would 
seem preferable for the blocks that are known to contain detected errors to be 
so labelled if practically possible. 



Cost. In a brute force approach, the time to create the generator matrix 
scales as iV 3 , where N is the block size. The encoding time scales as iV 2 , but 
encoding involves only binary arithmetic, so for the block lengths studied here 
it takes considerably less time than the simulation of the Gaussian channel. 
Decoding involves approximately 6Nj floating-point multiplies per iteration, 
so the total number of operations per decoded bit (assuming 20 iterations) 
is about 120t/R, independent of blocklength. For the codes presented in the 
next section, this is about 800 operations. 

The encoding complexity can be reduced by clever encoding tricks invented 
by Richardson and Urbanke (2001b) or by specially constructing the parity- 
check matrix (MacKay et at, 1999). 

The decoding complexity can be reduced, with only a small loss in perfor- 
mance, by passing low-precision messages in place of real numbers (Richardson 
and Urbanke, 2001a). 



► 47.4 Pictorial demonstration of Gallager codes 

Figures 47.3-47.7 illustrate visually the conditions under which low-density 
parity-check codes can give reliable communication over binary symmetric 
channels and Gaussian channels. These demonstrations may be viewed as 
animations on the world wide web. 1 



1 http : //www . inf erence .phy . cam. ac .uk/mackay/ codes/gif s/ 
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H = 



Figure 47.4. A low-density parity-check matrix with N = 20 000 columns of weight j = 3 and M = 
10 000 rows of weight k = 6. 



Encoding 

Figure 47.3 illustrates the encoding operation for the case of a Gallager code 
whose parity-check matrix is a 10 000 x 20 000 matrix with three Is per col- 
umn (figure 47.4). The high density of the generator matrix is illustrated in 
figure 47.3b and c by showing the change in the transmitted vector when one 
of the 10 000 source bits is altered. Of course, the source images shown here 
are highly redundant, and such images should really be compressed before 
encoding. Redundant images are chosen in these demonstrations to make it 
easier to see the correction process during the iterative decoding. The decod- 
ing algorithm does not take advantage of the redundancy of the source vector, 
and it would work in exactly the same way irrespective of the choice of source 
vector. 



Iterative decoding 

The transmission is sent over a channel with noise level / = 7.5% and the 
received vector is shown in the upper left of figure 47.5. The subsequent 
pictures in figure 47.5 show the iterative probabilistic decoding process. The 
sequence of figures shows the best guess, bit by bit, given by the iterative 
decoder, after 0, 1, 2, 3, 10, 11, 12, 13 iterations. The decoder halts after 
the 13th iteration when the best guess violates no parity checks. This final 
decoding is error free. 

In the case of an unusually noisy transmission, the decoding algorithm fails 
to find a valid decoding. For this code and a channel with / = 7.5%, such 
failures happen about once in every 100 000 transmissions. Figure 47.6 shows 
this error rate compared with the block error rates of classical error-correcting 
codes. 
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RECEIVED: 




REQUNDAt^ 





REDUNDAt^ 
GLASS, f 





REDUNDAt^ 
GLASS, f 




DECODED: 




REDUNDAt^ 
GLASS, f 




iREDUNDAt^ 
iGLASS/ 




Figure 47.5. Iterative probabilistic decoding of a low-density parity-check code for a transmission 
received over a channel with noise level / = 7.5%. The sequence of figures shows the best 
guess, bit by bit, given by the iterative decoder, after 0, 1, 2, 3, 10, 11, 12, 13 iterations. 
The decoder halts after the 13th iteration when the best guess violates no parity checks. 
This final decoding is error free. 




low-density 
parity-check code 

Shannon limit 



Figure 47.6. Error probability of 
the low-density parity-check code 
(with error bars) for binary 
symmetric channel with / = 7.5%, 
compared with algebraic codes. 
Squares: repetition codes and 
Hamming (7, 4) code; other 
points: Reed-Muller and BCH 
codes. 



0.2 0.4 0.6 0.8 1 
Rate 
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Figure 47.7. Demonstration of a 
Gallager code for a Gaussian 
channel, (al) The received vector 
after transmission over a Gaussian 
channel with x/a = 1.185 
(E b /N = 1.47 dB). The greyscale 
represents the value of the 
normalized likelihood. This 
transmission can be perfectly 
decoded by the sum-product 
decoder. The empirical 
probability of decoding failure is 
about 1CT 5 . (a2) The probability 
distribution of the output y of the 
channel with x/a = 1.185 for each 
of the two possible inputs, (bl) 
The received transmission over a 
Gaussian channel with x/a = 1.0, 
which corresponds to the Shannon 
limit. (b2) The probability 
distribution of the output y of the 
channel with x/a = 1.0 for each of 
the two possible inputs. 






0.1 




0.01 




0.001 


N=96 




(N=96) 


0.0001 


N=204 






1e-05 
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Gaussian channel 



(b) 



In figure 47.7 the left picture shows the received vector after transmission over 
a Gaussian channel with x/a = 1.185. The greyscale represents the value 
of the normalized likelihood, „. I , P ^A~/A — ttt- This signal-to- noise ratio 
x/a = 1.185 is a noise level at which this rate-1/2 Gallager code communicates 
reliably (the probability of error is ~ 10~ 5 ). To show how close we are to the 
Shannon limit, the right panel shows the received vector when the signal-to- 
noise ratio is reduced to x/a = 1.0, which corresponds to the Shannon limit 
for codes of rate 1/2. 



Figure 47.8. Performance of 
rate-V2 Gallager codes on the 
Gaussian channel. Vertical axis: 
block error probability. Horizontal 
axis: signal-to-noise ratio E(,/Nq. 
(a) Dependence on blocklcngth N 
for (j, k) = (3, 6) codes. From left 
to right: N = 816, N = 408, 
N = 204, N = 96. The dashed 
lines show the frequency of 
undetected errors, which is 
measurable only when the 
blocklcngth is as small as iV = 96 
or N = 204. (b) Dependence on 
column weight j for codes of 
blocklength N = 816. 



Variation of performance with code parameters 

Figure 47.8 shows how the parameters N and j affect the performance of 
low-density parity-check codes. As Shannon would predict, increasing the 
blocklength leads to improved performance. The dependence on j follows a 
different pattern. Given an optimal decoder, the best performance would be 
obtained for the codes closest to random codes, that is, the codes with largest 
j. However, the sum-product decoder makes poor progress in dense graphs, 
so the best performance is obtained for a small value of j. Among the values 
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Figure 47.9. Schematic illustration 
of constructions (a) of a 
completely regular Gallager code 
with j = 3, k = 6 and R= 1/2; 
(b) of a nearly-regular Gallager 
code with rate 1/3. Notation: an 
integer represents a number of 
permutation matrices superposed 
on the surrounding square. A 
diagonal line represents an 
identity matrix. 



Figure 47.10. Monte Carlo simulation of density evolution, following the decoding process for j = A, k = 
8. Each curve shows the average entropy of a bit as a function of number of iterations, 
as estimated by a Monte Carlo algorithm using 10 000 samples per iteration. The noise 
level of the binary symmetric channel / increases by steps of 0.005 from bottom graph 
(/ = 0.010) to top graph (/ = 0.100). There is evidently a threshold at about / = 0.075, 
above which the algorithm cannot determine x. From MacKay (1999b). 



of j shown in the figure, j = 3 is the best, for a blocklength of 816, down to a 
block error probability of 10~ 5 . 

This observation motivates construction of Gallager codes with some col- 
umns of weight 2. A construction with M/2 columns of weight 2 is shown in 
figure 47.9b. Too many columns of weight 2, and the code becomes a much 
poorer code. 

As we'll discuss later, we can do even better by making the code even more 
irregular. 

► 47.5 Density evolution 

One way to study the decoding algorithm is to imagine it running on an infinite 
tree-like graph with the same local topology as the Gallager code's graph. 
The larger the matrix H, the closer its decoding properties should approach 
those of the infinite graph. 

Imagine an infinite belief network with no loops, in which every bit x n 
connects to j checks and every check z m connects to k bits (figure 47.11). 
We consider the iterative flow of information in this network, and examine 
the average entropy of one bit as a function of number of iterations. At each 
iteration, a bit has accumulated information from its local network out to a 
radius equal to the number of iterations. Successful decoding will occur only 
if the average entropy of a bit decreases to zero as the number of iterations 
increases. 

The iterations of an infinite belief network can be simulated by Monte 
Carlo methods - a technique first used by Gallager (1963). Imagine a network 
of radius / (the total number of iterations) centred on one bit. Our aim is 
to compute the conditional entropy of the central bit x given the state z of 
all checks out to radius /. To evaluate the probability that the central bit 
is 1 given a particular syndrome z involves an /-step propagation from the 
outside of the network into the centre. At the ith iteration, probabilities r at 




Figure 47.11. Local topology of 
the graph of a Gallager code with 
column weight j = 3 and row 
weight k = 4. White nodes 
represent bits, xi; black nodes 
represent checks, z m \ each edge 
corresponds to a 1 in H. 
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radius I — i + 1 are transformed into qs and then into rs at radius / — i in 
a way that depends on the states x of the unknown bits at radius I — i. In 
the Monte Carlo method, rather than simulating this network exactly, which 
would take a time that grows exponentially with /, we create for each iteration 
a representative sample (of size 100, say) of the values of {r, x). In the case 
of a regular network with parameters j, k, each new pair {r, x} in the list at 
the ith iteration is created by drawing the new x from its distribution and 
drawing at random with replacement (j — l)(fc — 1) pairs {r, x} from the list at 
the (i — l)th iteration; these are assembled into a tree fragment (figure 47 '.12) 
and the sum-product algorithm is run from top to bottom to find the new r 
value associated with the new node. 

As an example, the results of runs with j = 4, k = 8 and noise densities / 
between 0.01 and 0.10, using 10000 samples at each iteration, are shown in 
figure 47.10. Runs with low enough noise level show a collapse to zero entropy 
after a small number of iterations, and those with high noise level decrease to 
a non-zero entropy corresponding to a failure to decode. 

The boundary between these two behaviours is called the threshold of the 
decoding algorithm for the binary symmetric channel. Figure 47.10 shows by 
Monte Carlo simulation that the threshold for regular (j, k) = (4, 8) codes 
is about 0.075. Richardson and Urbanke (2001a) have derived thresholds for 
regular codes by a tour de force of direct analytic methods. Some of these 
thresholds are shown in table 47.13. 

Approximate density evolution 

For practical purposes, the computational cost of density evolution can be 
reduced by making Gaussian approximations to the probability distributions 
over the messages in density evolution, and updating only the parameters of 
these approximations. For further information about these techniques, which 
produce diagrams known as EXIT charts, see (ten Brink, 1999; Chung et al, 
2001; ten Brink et al, 2002). 

► 47.6 Improving Gallager codes 

Since the rediscovery of Gallager codes, two methods have been found for 
enhancing their performance. 

Clump bits and checks together 

First, we can make Gallager codes in which the variable nodes are grouped 
together into metavariables consisting of say 3 binary variables, and the check 
nodes are similarly grouped together into metachecks. As before, a sparse 
graph can be constructed connecting metavariables to metachecks, with a lot 
of freedom about the details of how the variables and checks within are wired 
up. One way to set the wiring is to work in a finite field GF(q) such as GF(4) 
or GF(8), define low-density parity-check matrices using elements of GF{q), 
and translate our binary messages into GF(q) using a mapping such as the 
one for GF(4) given in table 47.14. Now, when messages are passed during 
decoding, those messages are probabilities and likelihoods over conjunctions 
of binary variables. For example if each clump contains three binary variables 
then the likelihoods will describe the likelihoods of the eight alternative states 
of those bits. 

With carefully optimized constructions, the resulting codes over GF(4), 




Figure 47.12. A tree-fragment 
constructed during Monte Carlo 
simulation of density evolution. 
This fragment is appropriate for a 
regular j = 3, k = 4 Gallager code. 





/max 


(3,6) 


0.084 


(4,8) 


0.076 


(5,10) 


0.068 



Table 47.13. Thresholds / max for 
regular low-density parity-check 
codes, assuming sum-product 
decoding algorithm, from 
Richardson and Urbanke (2001a). 
The Shannon limit for rate-V2 
codes is / max = 0.11. 



GF{4) <-» binary 




1 

A 
B 



00 
01 
10 
11 



Table 47.14. Translation between 
GF(4) and binary for message 
symbols. 



GF(4) binary 




1 
A 
B 



00 
00 

10 
01 

11 
10 

01 
11 



Table 47.15. Translation between 
GF(4) and binary for matrix 
entries. An M x N parity-check 
matrix over GF(4) can be turned 
into a 2M x 27V binary 
parity-check matrix in this way. 
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Algorithm 47.16. The Fourier 
transform over GF(4). 
The Fourier transform F of a 
function / over GF(2) is given by 
F° = f° + f 1 ,F 1 =f-f 1 . 
Transforms over GF(2 k ) can be 
viewed as a sequence of binary 
transforms in each of k 
dimensions. The inverse 
transform is identical to the 
Fourier transform, except that we 
also divide by 2 k . 



Figure 47.17. Comparison of regular binary Gallager codes with irregular codes, codes over GF(q), 
and other outstanding codes of rate Vi. From left (best performance) to right: Irregular 
low-density parity-check code over GF(8), blocklength 48 000 bits (Davcy, 1999); JPL 
turbo code (JPL, 1996) blocklength 65 536; Regular low-density parity-check over G_F(16), 
blocklength 24 448 bits (Davcy and MacKay, 1998); Irregular binary low-density parity- 
check code, blocklength 16 000 bits (Davey, 1999); Luby et al. (1998) irregular binary 
low-density parity-check code, blocklength 64 000 bits; JPL code for Galileo (in 1992, 
this was the best known code of rate 1/4); Regular binary low-density parity-check code: 
blocklength 40 000 bits (MacKay, 1999b). The Shannon limit is at about -0.79dB. As of 
2003, even better sparse-graph codes have been constructed. 



GF(8), and G-F(16) perform nearly one decibel better than comparable binary 
Gallager codes. 

The computational cost for decoding in GF(q) scales as glogg, if the ap- 
propriate Fourier transform is used in the check nodes: the update rule for 
the check-to-variable message, 



>X n ' — z n 



n 



(47.15) 



jeN(m)\n 



is a convolution of the quantities q^j, so the summation can be replaced by 
a product of the Fourier transforms of qr° ■ for j G Af(m)\n, followed by 
an inverse Fourier transform. The Fourier transform for GF(4) is shown in 
algorithm 47.16. 



Make the graph irregular 

The second way of improving Gallager codes, introduced by Luby et al. (2001b), 
is to make their graphs irregular. Instead of giving all variable nodes the same 
degree j, we can have some variable nodes with degree 2, some 3, some 4, and 
a few with degree 20. Check nodes can also be given unequal degrees - this 
helps improve performance on erasure channels, but it turns out that for the 
Gaussian channel, the best graphs have regular check degrees. 

Figure 47.17 illustrates the benefits offered by these two methods for im- 
proving Gallager codes, focussing on codes of rate V4. Making the binary code 
irregular gives a win of about 0.4 dB; switching from GF(2) to G-F(16) gives 
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about 0.6 dB; and Matthew Davey's code that combines both these features - 
it's irregular over GF(8) - gives a win of about 0.9 dB over the regular binary 
Gallager code. 

Methods for optimizing the profile of a Gallager code (that is, its number of 
rows and columns of each degree), have been developed by Richardson et al. 
(2001) and have led to low-density parity-check codes whose performance, 
when decoded by the sum-product algorithm, is within a hair's breadth of the 
Shannon limit. 



Figure 47.18. An algebraically 
constructed low-density 
parity-check code satisfying many 
redundant constraints 
outperforms an equivalent random 
Gallager code. The table shows 
the N, M, K, distance d, and row 
weight k of some difference-set 
cyclic codes, highlighting the 
codes that have large d/N, small 
k, and large N/M. In the 
comparison the Gallager code had 
(j, k) = (4, 13), and rate identical 
to the N = 273 difference-set 
cyclic code. 



Algebraic constructions of Gallager codes 

The performance of regular Gallager codes can be enhanced in a third man- 
ner: by designing the code to have redundant sparse constraints. There is a 
difference-set cyclic code, for example, that has N = 273 and K = 191, but 
the code satisfies not M = 82 but N, i.e., 273 low- weight constraints (figure 
47.18). It is impossible to make random Gallager codes that have anywhere 
near this much redundancy among their checks. The difference-set cyclic code 
performs about 0.7 dB better than an equivalent random Gallager code. 

An open problem is to discover codes sharing the remarkable properties of 
the difference-set cyclic codes but with different blocklengths and rates. I call 
this task the Tanner challenge. 



► 47.7 Fast encoding of low-density parity-check codes 

We now discuss methods for fast encoding of low-density parity-check codes - 
faster than the standard method, in which a generator matrix G is found by 
Gaussian elimination (at a cost of order M 3 ) and then each block is encoded 
by multiplying it by G (at a cost of order M 2 ). 



Staircase codes 

Certain low-density parity-check matrices with M columns of weight 2 or less 
can be encoded easily in linear time. For example, if the matrix has a staircase 
structure as illustrated by the right-hand side of 









i 


1 






: 










1 _ 


: 




l 






l 






1 1 






- 




l 












1 1 




1 " 1 














: 




1 1 










l 










1 1 






: 






1 1 
















l 




l 












l 


1 1 




- 
























- 












- 




1 1 




l 










i 




- 




1 1 












i 




• 




l 



























Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



570 



47 — Low-Density Parity-Check Codes 



and if the data s are loaded into the first K bits, then the M parity bits p 
can be computed from left to right in linear time. 



Pi = 
P2 = 
P3 = 



Pi 
P2 



E»=i H\ n s n 
E n =i H 2n s n 
E«=i H 3n s n 



(47.17) 



PM 



Pm- 



■1+ J2n=l^MnS n - 



If we call two parts of the H matrix [H s |H p ], we can describe the encoding 
operation in two steps: first compute an intermediate parity vector v = H s s; 
then pass v through an accumulator to create p. 

The cost of this encoding method is linear if the sparsity of H is exploited 
when computing the sums in (47.17). 



Fast encoding of general low-density parity-check codes 

Richardson and Urbanke (2001b) demonstrated an elegant method by which 
the encoding cost of any low-density parity-check code can be reduced from 
the straightforward method's M 2 to a cost of N + g 2 , where g, the gap, is 
hopefully a small constant, and in the worst cases scales as a small fraction of 
N. 

■M 







^\ 


A 


B 








t\ 


C 


D 


E 


- M * 



M 



Figure 47.19. The parity-check 
matrix in approximate 
lower-triangular form. 



In the first step, the parity-check matrix is rearranged, by row-interchange 
and column-interchange, into the approximate lower-triangular form shown in 
figure 47.19. The original matrix H was very sparse, so the six matrices A, 
B, T, C, D, and E are also very sparse. The matrix T is lower triangular and 
has Is everywhere on the diagonal. 



H 



A B T 
C D E 



(47.18) 



The source vector s of length K = N — M is encoded into a transmission 
t = [s,pi,p 2 ] as follows. 



1. Compute the upper syndrome of the source vector, 

za = As. 
This can be done in linear time. 



(47.19) 



2. Find a setting of the second parity bits, p^, such that the upper syn- 
drome is zero. 



P2 



-T-'z A . 



(47.20) 

This vector can be found in linear time by back-substitution, i.e., com- 
puting the first bit of p 2 , then the second, then the third, and so forth. 
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47.8: Further reading 

3. Compute the lower syndrome of the vector [s,0,p^]: 

z B = Cs - 
This can be done in linear time. 

4. Now we get to the clever bit. Define the matrix 

F^-ET^B + D, (47.22) 

and find its inverse, F _1 . This computation needs to be done once only, 
and its cost is of order g 3 . This inverse F _1 is a dense gx g matrix. [If F 
is not invertible then either H is not of full rank, or else further column 
permutations of H can produce an F that is invertible.] 

Set the first parity bits, pi, to 

Pi = -F _1 z B 
This operation has a cost of order g 2 . 

Claim: At this point, we have found the correct setting of the first parity 
bits, pi. 

5. Discard the tentative parity bits p^ and find the new upper syndrome, 

z c = z A + B Pl . (47.24) 
This can be done in linear time. 

6. Find a setting of the second parity bits, P2, such that the upper syndrome 
is zero, 

P2 = -T^zc (47.25) 
This vector can be found in linear time by back-substitution. 

► 47.8 Further reading 

Low-density parity-check codes codes were first studied in 1962 by Gallager, 
then were generally forgotten by the coding theory community. Tanner (1981) 
generalized Gallager's work by introducing more general constraint nodes; the 
codes that are now called turbo product codes should in fact be called Tanner 
product codes, since Tanner proposed them, and his colleagues (Karplus and 
Krit, 1991) implemented them in hardware. Publications on Gallager codes 
contributing to their 1990s rebirth include (Wiberg et al, 1995; MacKay and 
Neal, 1995; MacKay and Neal, 1996; Wiberg, 1996; MacKay, 1999b; Spielman, 
1996; Sipser and Spielman, 1996). Low-precision decoding algorithms and fast 
encoding algorithms for Gallager codes are discussed in (Richardson and Ur- 
banke, 2001a; Richardson and Urbanke, 2001b). MacKay and Davey (2000) 
showed that low-density parity-check codes can outperform Reed-Solomon 
codes, even on the Reed-Solomon codes' home turf: high rate and short block- 
lengths. Other important papers include (Luby et al, 2001a; Luby et al, 
2001b; Luby et al, 1997; Davey and MacKay, 1998; Richardson et al, 2001; 
Chung et al, 2001). Useful tools for the design of irregular low-density parity- 
check codes include (Chung et al, 1999; Urbanke, 2001). 

See (Wiberg, 1996; Frey, 1998; McEliece et al, 1998) for further discussion 
of the sum-product algorithm. 

For a view of low-density parity-check code decoding in terms of group 
theory and coding theory, see (Forney, 2001; Offer and Soljanin, 2000; Offer 



(47.21) 



(47.23) 
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and Soljanin, 2001); and for background reading on this topic see (Hartmann 
and Rudolph, 1976; Terras, 1999). There is a growing literature on the prac- 
tical design of low-density parity-check codes (Mao and Banihashemi, 2000; 
Mao and Banihashemi, 2001; ten Brink et al., 2002); they are now being 
adopted for applications from hard drives to satellite communications. 

For low-density parity-check codes applicable to quantum error-correction, 
see MacKay et al. (2003). 

► 47.9 Exercises 

Exercise 47. 1.^ The 'hyperbolic tangent' version of the decoding algorithm. 

In section 47.3, the sum-product decoding algorithm for low-density 
parity-check codes was presented first in terms of quantities qmn and 
Tmn, then in terms of quantities 5q and Sr. There is a third description, 
in which the {q} are replaced by log probability-ratios, 

l mn ^\n^. (47.26) 

Qmn 

Show that 

Sq mn = Qmn ~ iLn = tanh(Z mn /2). (47.27) 
Derive the update rules for {r} and {/}. 

Exercise 47. 2 J 2 ' p - 572 l I am sometimes asked 'why not decode other linear 
codes, for example algebraic codes, by transforming their parity-check 
matrices so that they are low-density, and applying the sum-product 
algorithm?' [Recall that any linear combination of rows of H, H' = PH, 
is a valid parity-check matrix for a code, as long as the matrix P is 
invertible; so there are many parity check matrices for any one code.] 

Explain why a random linear code does not have a low-density parity- 
check matrix. [Here, low-density means 'having row- weight at most k\ 
where k is some small constant <C N] 

Exercise 47.3. ^ Show that if a low-density parity-check code has more than 
M columns of weight 2 - say uM columns, where a > 1 - then the code 
will have words with weight of order log M. 

Exercise 47.4.^ In section 13.5 we found the expected value of the weight 
enumerator function A(w), averaging over the ensemble of all random 
linear codes. This calculation can also be carried out for the ensemble of 
low-density parity-check codes (Gallager, 1963; MacKay, 1999b; Litsyn 
and Shevelev, 2002). It is plausible, however, that the mean value of 
A(w) is not always a good indicator of the typical value of A(w) in the 
ensemble. For example, if, at a particular value of w, 99% of codes have 
A(w) = 0, and 1% have A(w) = 100 000, then while we might say the 
typical value of A(w) is zero, the mean is found to be 1000. Find the 
typical weight enumerator function of low-density parity-check codes. 

► 47.10 Solutions 

Solution to exercise 47.2 (p. 572). Consider codes of rate R and blocklength 
7V, having K = RN source bits and M = {1 — R)N parity-check bits. Let all 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 

47.10: Solutions 

the codes have their bits ordered so that the first K bits are independent, so 
that we could if we wish put the code in systematic form, 

G= [1k|P t ]; H= [P|1 m ]. (47.28) 
The number of distinct linear codes is the number of matrices P, which is 

M = 2 MK = 2 n2r{1 - r \ Can these all be expressed as distinct low-density logM ~ N 2 R{1 - R) 
parity-check codes? 

The number of low-density parity-check matrices with row- weight k is 




(47.29) 



and the number of distinct codes that they define is at most 




(47.30) 



which is much smaller than A/i, so, by the pigeon-hole principle, it is not log A/2 < NklogN 
possible for every random linear code to map on to a low-density H. 
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48 



Convolutional Codes and Turbo Codes 



This chapter follows tightly on from Chapter 25. It makes use of the ideas of 
codes and trellises and the forward-backward algorithm. 

► 48.1 Introduction to convolutional codes 

When we studied linear block codes, we described them in three ways: 

1. The generator matrix describes how to turn a string of K arbitrary 
source bits into a transmission of N bits. 

2. The parity-check matrix specifies the M = N — K parity-check con- 
straints that a valid codeword satisfies. 

3. The trellis of the code describes its valid codewords in terms of paths 
through a trellis with labelled edges. 

A fourth way of describing some block codes, the algebraic approach, is not 
covered in this book (a) because it has been well covered by numerous other 
books in coding theory; (b) because, as this part of the book discusses, the 
state of the art in error-correcting codes makes little use of algebraic coding 
theory; and (c) because I am not competent to teach this subject. 

We will now describe convolutional codes in two ways: first, in terms of 
mechanisms for generating transmissions t from source bits s; and second, in 
terms of trellises that describe the constraints satisfied by valid transmissions. 

► 48.2 Linear-feedback shift-registers 

We generate a transmission with a convolutional code by putting a source 
stream through a linear filter. This filter makes use of a shift register, linear 
output functions, and, possibly, linear feedback. 

I will draw the shift-register in a right-to-left orientation: bits roll from 
right to left as time goes on. 

Figure 48.1 shows three linear- feedback shift-registers which could be used 
to define convolutional codes. The rectangular box surrounding the bits 
z\ ... z-j indicate the memory of the filter, also known as its state. All three 
filters have one input and two outputs. On each clock cycle, the source sup- 
plies one bit, and the filter outputs two bits v- a > and t^ b \ By concatenating 
together these bits we can obtain from our source stream S1S2S3 ... a trans- 
mission stream if" 4 4 4 • ■ ■ ■ Because there are two transmitted bits 
for every source bit, the codes shown in figure 48.1 have rate V2. Because 
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Figure 48.1. Linear-feedback 
shift-registers for generating 
convolutional codes with rate 1/2. 
The symbol (HI indicates a 
copying with a delay of one clock 
cycle. The symbol denotes 
linear addition modulo 2 with no 
delay. 



4" 



-8 — -8- 
-t T"H 1 



27 (75J 26 (75] «5 (75J 2:4 (75J 2 3 (75J 2 2 (75J 2l (75] Zo 

— h-l — i — t 1 



fib) 



(c) 



I p t(«) 

\ L ' 371/8 



these filters require fc = 7 bits of memory, the codes they define are known as 
a constraint-length 7 codes. 

Convolutional codes come in three flavours, corresponding to the three 
types of filter in figure 48.1. 



Systematic nonrecursive 

The filter shown in figure 48.1a has no feedback. It also has the property that 
one of the output bits, t^ a \ is identical to the source bit s. This encoder is 
thus called systematic, because the source bits are reproduced transparently 
in the transmitted stream, and nonrecursive, because it has no feedback. The 
other transmitted bit t^ is a linear function of the state of the filter. One 
way of describing that function is as a dot product (modulo 2) between two 
binary vectors of length k + 1: a binary vector = (1,1, 1,0, 1,0, 1,1) and 
the state vector z = (zk,Zk-i, ... ,z\, zq). We include in the state vector the 

bit zo that will be put into the first bit of the memory on the next cycle. The 11101011 
vector g( b ) has = 1 for every k where there is a tap (a downward pointing III 
arrow) from state bit z K into the transmitted bit t^ b \ 3 5 3 

A convenient way to describe these binary tap vectors is in octal. Thus, Table 48.2. How taps in the delay 
this filter makes use of the tap vector 353s- I have drawn the delay lines from line are converted to octal, 
right to left to make it easy to relate the diagrams to these octal numbers. 



Nonsystematic nonrecursive 

The filter shown in figure 48.1b also has no feedback, but it is not systematic. 
It makes use of two tap vectors g' a ' and to create its two transmitted bits. 
This encoder is thus nonsystematic and nonrecursive. Because of their added 
complexity, nonsystematic codes can have error-correcting abilities superior to 
those of systematic nonrecursive codes with the same constraint length. 
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Systematic recursive 



The filter shown in figure 48.1c is similar to the nonsystematic nonrecursive 
filter shown in figure 48.1b, but it uses the taps that formerly made up g*- a ' 
to make a linear signal that is fed back into the shift register along with the 
source bit. The output t^ is a linear function of the state vector as before. 
The other output is t^ = s, so this filter is systematic. 

A recursive code is conventionally identified by an octal ratio, e.g., fig- 
ure 48.1c's code is denoted by (247/371) 8 . 

Equivalence of systematic recursive and nonsystematic nonrecursive codes 

The two filters in figure 48.1b,c are code-equivalent in that the sets of code- 
words that they define are identical. For every codeword of the nonsystematic 
nonrecursive code we can choose a source stream for the other encoder such 
that its output is identical (and vice versa). 

To prove this, we denote by p the quantity X^k=i 9k z ki as shown in fig- 
ure 48.3a and b, which shows a pair of smaller but otherwise equivalent filters. 
If the two transmissions are to be equivalent - that is, the t^s are equal in 
both figures and so are the t^ b 'S - then on every cycle the source bit in the 
systematic code must be s = t^ a \ So now we must simply confirm that for 
this choice of s, the systematic code's shift register will follow the same state 
sequence as that of the nonsystematic code, assuming that the states match 
initially. In figure 48.3a we have 



t («) 



pi 



nonrecursive 



whereas in figure 48.3b we have 

^recursive = t (a) & p 

Substituting for (W, and using p (Bp = we immediately find 



(48.1) 



(48.2) 
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Z2 (5] «1 (5\ Zq- 
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Figure 48.3. Two ratc-1/2 
convolutional codes with 
constraint length k = 2: 
(a) non-recursive; (b) recursive. 
The two codes are equivalent. 



recursive nonrecursive 



(48.3) 



Thus, any codeword of a nonsystematic nonrecursive code is a codeword of 
a systematic recursive code with the same taps - the same taps in the sense 
that there are vertical arrows in all the same places in figures 48.3(a) and (b), 
though one of the arrows points up instead of down in (b). 

Now, while these two codes are equivalent, the two encoders behave dif- 
ferently. The nonrecursive encoder has a finite impulse response, that is, if 
one puts in a string that is all zeroes except for a single one, the resulting 
output stream contains a finite number of ones. Once the one bit has passed 
through all the states of the memory, the delay line returns to the all-zero 
state. Figure 48.4a shows the state sequence resulting from the source string 
s =(0, 0, 1, 0, 0, 0, 0, 0). 

Figure 48.4b shows the trellis of the recursive code of figure 48.3b and the 
response of this filter to the same source string s =(0, 0, 1, 0, 0, 0, 0, 0). The 
filter has an infinite impulse response. The response settles into a periodic 
state with period equal to three clock cycles. 

> Exercise 48.1.^ What is the input to the recursive filter such that its state 
sequence and the transmission are the same as those of the nonrecursive 
filter? (Hint: see figure 48.5.) 
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(a) 



(b) 




10 



transmit 0000111011000000 
source 1 




□ ■■□ ■■ 



transmit 0000110101000101 
source 1 



Figure 48.4. Trellises of the 
rate-1/2 convolutional codes of 
figure 48.3. It is assumed that the 
initial state of the filter is 
(22, zi) = (0, 0). Time is on the 
horizontal axis and the state of 
the filter at each time step is the 
vertical coordinate. On the line 
segments are shown the emitted 
symbols t 1 -^ and , with stars 
for 'I' and boxes for '0'. The 
paths taken through the trellises 
when the source sequence is 
00100000 are highlighted with a 
solid line. The light dotted lines 
show the state trajectories that 
are possible for other source 
sequences. 



■ ■□■ ■•■ -X □■ ■•■ ■ ■* □■ ■ 




transmit 0000111011000000 
source 1 1 1 



Figure 48.5. The source sequence 
for the systematic recursive code 
00111000 produces the same path 
through the trellis as 00100000 
does in the nonsystematic 



nonrecursive case. 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 

578 48 — Convolutional Codes and Turbo Codes 



1 t 

Z4 (d] z 3 (d\ z 2 (d] z\ (d] z 

— t — i — H t 



t(*> 



mi 

1110 
1101 
1100 
1011 
1010 
1001 
1000 
0111 
0110 
0101 
0100 
0011 
0010 
0001 
0000 - 
received 



r 



(21/37) s 




Figure 48.6. The trellis for a k = 4 
code painted with the likelihood 
function when the received vector 
is equal to a codeword with just 
one bit flipped. There are three 
line styles, depending on the value 
of the likelihood: thick solid lines 
show the edges in the trellis that 
match the corresponding two bits 
of the received string exactly; 
thick dotted lines show edges that 
match one bit but mismatch the 
other; and thin dotted lines show 
the edges that mismatch both 
bits. 
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1 



In general a linear-feedback shift-register with k bits of memory has an impulse 
response that is periodic with a period that is at most 2 k — 1, corresponding 
to the filter visiting every non-zero state in its state space. 

Incidentally cheap pseudorandom number generators and cheap crypto- 
graphic products make use of exactly these periodic sequences, though with 
larger values of k than 7; the random number seed or cryptographic key se- 
lects the initial state of the memory. There is thus a close connection between 
certain cryptanalysis problems and the decoding of convolutional codes. 



► 48.3 Decoding convolutional codes 

The receiver receives a bit stream, and wishes to infer the state sequence 
and thence the source stream. The posterior probability of each bit can be 
found by the sum-product algorithm (also known as the forward-backward or 
BCJR algorithm), which was introduced in section 25.3. The most probable 
state sequence can be found using the min-sum algorithm of section 25.3 
(also known as the Viterbi algorithm). The nature of this task is illustrated 
in figure 48.6, which shows the cost associated with each edge in the trellis 
for the case of a sixteen-state code; the channel is assumed to be a binary 
symmetric channel and the received vector is equal to a codeword except that 
one bit has been flipped. There are three line styles, depending on the value 
of the likelihood: thick solid lines show the edges in the trellis that match the 
corresponding two bits of the received string exactly; thick dotted lines show 
edges that match one bit but mismatch the other; and thin dotted lines show 
the edges that mismatch both bits. The min-sum algorithm seeks the path 
through the trellis that uses as many solid lines as possible; more precisely, it 
minimizes the cost of the path, where the cost is zero for a solid line, one for 
a thick dotted line, and two for a thin dotted line. 

> Exercise 48.2.^' P- 581 l Can you spot the most probable path and the flipped 
bit? 
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Figure 48.7. Two paths that differ 
in two transmitted bits only. 
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Figure 48.8. A terminated trellis. 
When any codeword is completed, 
the filter state is 0000. 



Unequal protection 

A defect of the convolutional codes presented thus far is that they offer un- 
equal protection to the source bits. Figure 48.7 shows two paths through the 
trellis that differ in only two transmitted bits. The last source bit is less well 
protected than the other source bits. This unequal protection of bits motivates 
the termination of the trellis. 

A terminated trellis is shown in figure 48.8. Termination slightly reduces 
the number of source bits used per codeword. Here, four source bits are turned 
into parity bits because the k = 4 memory bits must be returned to zero. 

► 48.4 Turbo codes 

An (N,K) turbo code is defined by a number of constituent convolutional 
encoders (often, two) and an equal number of interleavers which are K x K 
permutation matrices. Without loss of generality, we take the first interleaver 
to be the identity matrix. A string of K source bits is encoded by feeding them 
into each constituent encoder in the order defined by the associated interleaver, 
and transmitting the bits that come out of each constituent encoder. Often 
the first constituent encoder is chosen to be a systematic encoder, just like the 
recursive filter shown in figure 48.6, and the second is a non-systematic one of 
rate 1 that emits parity bits only. The transmitted codeword then consists of 



L0 



c 2 - 



Figure 48.10. The encoder of a 
turbo code. Each box Ci, C 2 , 
contains a convolutional code. 
The source bits are reordered 
using a permutation n before they 
are fed to C 2 - The transmitted 
codeword is obtained by 
concatenating or interleaving the 
outputs of the two convolutional 
codes. 
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Figure 48.9. Rate-1/3 (a) and rate-1/2 (b) turbo codes represented as factor graphs. The circles 
represent the codeword bits. The two rectangles represent trellises of rate-1 /2 convolutional 
codes, with the systematic bits occupying the left half of the rectangle and the parity bits 
occupying the right half. The puncturing of these constituent codes in the rate-1/2 turbo 
code is represented by the lack of connections to half of the parity bits in each trellis. 



K source bits followed by Mi parity bits generated by the first convolutional 
code and M2 parity bits from the second. The resulting turbo code has rate 
Vs. 

The turbo code can be represented by a factor graph in which the two 
trellises are represented by two large rectangular nodes (figure 48.9a); the K 
source bits and the first M\ parity bits participate in the first trellis and the K 
source bits and the last M2 parity bits participate in the second trellis. Each 
codeword bit participates in either one or two trellises, depending on whether 
it is a parity bit or a source bit. Each trellis node contains a trellis exactly like 
the terminated trellis shown in figure 48.8, except one thousand times as long. 
[There are other factor graph representations for turbo codes that make use 
of more elementary nodes, but the factor graph given here yields the standard 
version of the sum-product algorithm used for turbo codes.] 

If a turbo code of smaller rate such as V2 is required, a standard modifica- 
tion to the rate-V3 code is to puncture some of the parity bits (figure 48.9b). 

Turbo codes are decoded using the sum-product algorithm described in 
Chapter 26. On the first iteration, each trellis receives the channel likelihoods, 
and runs the forward-backward algorithm to compute, for each bit, the relative 
likelihood of its being 1 or 0, given the information about the other bits. 
These likelihoods are then passed across from each trellis to the other, and 
multiplied by the channel likelihoods on the way. We are then ready for the 
second iteration: the forward-backward algorithm is run again in each trellis 
using the updated probabilities. After about ten or twenty such iterations, it's 
hoped that the correct decoding will be found. It is common practice to stop 
after some fixed number of iterations, but we can do better. 

As a stopping criterion, the following procedure can be used at every iter- 
ation. For each time-step in each trellis, we identify the most probable edge, 
according to the local messages. If these most probable edges join up into two 
valid paths, one in each trellis, and if these two paths are consistent with each 
other, it is reasonable to stop, as subsequent iterations are unlikely to take 
the decoder away from this codeword. If a maximum number of iterations is 
reached without this stopping criterion being satisfied, a decoding error can 
be reported. This stopping procedure is recommended for several reasons: it 
allows a big saving in decoding time with no loss in error probability; it allows 
decoding failures that are detected by the decoder to be so identified - knowing 
that a particular block is definitely corrupted is surely useful information for 
the receiver! And when we distinguish between detected and undetected er- 
rors, the undetected errors give helpful insights into the low weight codewords 
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of the code, which may improve the process of code design. 

Turbo codes as described here have excellent performance down to decoded 
error probabilities of about 1CT 5 , but randomly-constructed turbo codes tend 
to have an error Boor starting at that level. This error floor is caused by low- 
weight codewords. To reduce the height of the error floor, one can attempt 
to modify the random construction to increase the weight of these low-weight 
codewords. The tweaking of turbo codes is a black art, and it never succeeds 
in totalling eliminating low-weight codewords; more precisely, the low-weight 
codewords can only be eliminated by sacrificing the turbo code's excellent 
performance. In contrast, low-density parity-check codes rarely have error 
floors. 

► 48.5 Parity-check matrices of convolutional codes and turbo codes 

We close by discussing the parity-check matrix of a rate- V2 convolutional code 
viewed as a linear block code. We adopt the convention that the TV bits of one 
block are made up of the N/2 bits t^ followed by the N/2 bits t {b \ 

> Exercise 48.3. ^ Prove that a convolutional code has a low-density parity- 
check matrix as shown schematically in figure 48.11a. 

Hint: It's easiest to figure out the parity constraints satisfied by a convo- 
lutional code by thinking about the nonsystematic nonrecursive encoder 
(figure 48.1b). Consider putting through filter a a stream that's been 
through convolutional filter b, and vice versa; compare the two resulting 
streams. Ignore termination of the trellises. 

The parity-check matrix of a turbo code can be written down by listing the 
constraints satisfied by the two constituent trellises (figure 48.11b). So turbo 
codes are also special cases of low-density parity-check codes. If a turbo code 
is punctured, it no longer necessarily has a low-density parity-check matrix, 
but it always has a generalized parity-check matrix that is sparse, as explained 
in the next chapter. 



(a) 




(b) 



Figure 48.11. Schematic pictures 
of the parity-check matrices of (a) 
a convolutional code, rate 1/2, 
and (b) a turbo code, rate 1/3. 
Notation: A diagonal line 
represents an identity matrix. A 
band of diagonal lines represent a 
band of diagonal Is. A circle 
inside a square represents the 
random permutation of all the 
columns in that square. A number 
inside a square represents the 
number of random permutation 
matrices superposed in that 
square. Horizontal and vertical 
lines indicate the boundaries of 
the blocks within the matrix. 



Further reading 

For further reading about convolutional codes, Johannesson and Zigangirov 
(1999) is highly recommended. One topic I would have liked to include is 
sequential decoding. Sequential decoding explores only the most promising 
paths in the trellis, and backtracks when evidence accumulates that a wrong 
turning has been taken. Sequential decoding is used when the trellis is too 
big for us to be able to apply the maximum likelihood algorithm, the min- 
sum algorithm. You can read about sequential decoding in Johannesson and 
Zigangirov (1999). 

For further information about the use of the sum-product algorithm in 
turbo codes, and the rarely-used but highly recommended stopping criteria 
for halting their decoding, Frey (1998) is essential reading. (And there's lots 
more good stuff in the same book!) 



► 48.6 Solutions 



Solution to exercise 48.2 (p. 578). The first bit was flipped. The most probable 
path is the upper one in figure 48.7. 
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49 



Repeat-Accumulate Codes 



In Chapter 1 we discussed a very simple and not very effective method for 
communicating over a noisy channel: the repetition code. We now discuss a 
code that is almost as simple, and whose performance is outstandingly good. 

Repeat-accumulate codes were studied by Divsalar et al. (1998) for theo- 
retical purposes, as simple turbo-like codes that might be more amenable to 
analysis than messy turbo codes. Their practical performance turned out to 
be just as good as other sparse-graph codes. 



49.1 


The encoder 




1 


Take K source bits. 

SlS 2 S 3 ...SK 




2 


Repeat each bit three times, giving N = 3K bits. 

S1S1S1S2S2S2S3S3S3 . . . s K s K s K 




3 


Permute these N bits using a random permutation (a fixed random 
permutation - the same one for every codeword). Call the permuted 
string u. 


4 


Transmit the accumulated sum. 






h = "I 

t 2 = t\ + U2 (mod 2) 
••• t n = t n -\ + u n (mod 2) ... 
t;v = i/v-i + u n (mod 2). 


(49.1) 


5 


That's it! 




49.2 


Graph 




Figure 


49.1a shows the graph of a repeat-accumulate code, 


using four types 



of node: equality constraints E, intermediate binary variables (black circles), 
parity constraints 0, and the transmitted bits (white circles). 

The source sets the values of the black bits at the bottom, three at a time, 
and the accumulator computes the transmitted bits along the top. 
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583 






(b) 



Figure 49.1. Factor graphs for a 
repeat-accumulate code with rate 
1/3. (a) Using elementary nodes. 
Each white circle represents a 
transmitted bit. Each 
constraint forces the sum of the 3 
bits to which it is connected to be 
even. Each black circle represents 
an intermediate binary variable. 
Each El constraint forces the three 
variables to which it is connected 
to be equal. 

(b) Factor graph normally used 
for decoding. The top rectangle 
represents the trellis of the 
accumulator, shown in the inset. 




Figure 49.2. Performance of six 
rate-V3 repeat-accumulate codes 
on the Gaussian channel. The 
blocklcngths range from N = 204 
to N = 30 000. Vertical axis: 
block error probability; horizontal 
axis: Eb/N . The dotted lines 
show the frequency of undetected 



1e-05 



This graph is a factor graph for the prior probability over codewords, 
with the circles being binary variable nodes, and the squares representing 
two types of factor nodes. As usual, each contributes a factor of the form 
lEa; = 0mod2]; each B contributes a factor of the form t[xi = x 2 = x^\. 



► 49.3 Decoding 

The repeat-accumulate code is normally decoded using the sum-product algo- 
rithm on the factor graph depicted in figure 49.1b. The top box represents the 
trellis of the accumulator, including the channel likelihoods. In the first half 
of each iteration, the top trellis receives likelihoods for every transition in the 
trellis, and runs the forward-backward algorithm so as to produce likelihoods 
for each variable node. In the second half of the iteration, these likelihoods 
are multiplied together at the E nodes to produce new likelihood messages to 
send back to the trellis. 

As with Gallager codes and turbo codes, the stop-when-it's-done decoding 
method can be applied, so it is possible to distinguish between undetected 
errors (which are caused by low-weight codewords in the code) and detected 
errors (where the decoder gets stuck and knows that it has failed to find a 
valid answer). 

Figure 49.2 shows the performance of six randomly-constructed repeat- 
accumulate codes on the Gaussian channel. If one does not mind the error 
floor which kicks in at about a block error probability of 10 ~ 4 , the performance 
is staggeringly good for such a simple code (cf. figure 47.17). 
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total 




detected — e— 




\ undetected 












^ O 



0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

(a) 




20 40 60 80 100 120 140 16 

(ii.b) E b /N = 0.749 dB 



20 40 60 80 100 120 140 160 180 

(ii.c) E b /N = 0.846 dB 



30 40 50 



(iii.b) 



40 50 60 70 80 90100 



► 49.4 Empirical distribution of decoding times 

It is interesting to study the number of iterations r of the sum-product algo- 
rithm required to decode a sparse-graph code. Given one code and a set of 
channel conditions, the decoding time varies randomly from trial to trial. We 
find that the histogram of decoding times follows a power law, P(t) oc t~ p , 
for large r. The power p depends on the signal-to-noise ratio and becomes 
smaller (so that the distribution is more heavy-tailed) as the signal-to-noise 
ratio decreases. We have observed power laws in rep eat- accumulate codes 
and in irregular and regular Gallager codes. Figures 49.3(h) and (iii) show the 
distribution of decoding times of a repeat-accumulate code at two different 
signal-to-noise ratios. The power laws extend over several orders of magnitude. 

Exercise 49. 1.^ Investigate these power laws. Does density evolution predict 
them? Can the design of a code be used to manipulate the power law in 
a useful way? 



(iii.c) 

Figure 49.3. Histograms of 
number of iterations to find a 
valid decoding for a 
repeat-accumulate code with 
source block length K = 10 000 
and transmitted blocklength 
N = 30 000. (a) Block error- 
probability versus signal-to-noise 
ratio for the RA code, (ii.b) 
Histogram for x/cr = 0.89, 
E b /N = 0.749 dB. (ii.c) 
x/o = 0.90, E b /N = 0.846 dB. 
(iii.b, iii.c) Fits of power laws to 
(ii.b) (1/r 6 ) and (ii.c) (1/r 9 ). 



► 49.5 Generalized parity-check matrices 

I find that it is helpful when relating sparse-graph codes to each other to use 
a common representation for them all. Forney (2001) introduced the idea of 
a normal graph in which the only nodes are and El and all variable nodes 
have degree one or two; variable nodes with degree two can be represented on 
edges that connect a node to a El node. The generalized parity-check matrix 
is a graphical way of representing normal graphs. In a parity-check matrix, 
the columns are transmitted bits, and the rows are linear constraints. In a 
generalized parity-check matrix, additional columns may be included, which 
represent state variables that are not transmitted. One way of thinking of these 
state variables is that they are punctured from the code before transmission. 

State variables are indicated by a horizontal line above the corresponding 
columns. The other pieces of diagrammatic notation for generalized parity- 
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H 
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Figure 49.4. The generator 
matrix, parity-check matrix, and a 
generalized parity-check matrix of 
a repetition code with rate V3. 



{A,p} = 




check matrices are, as in (MacKay, 1999b; MacKay et at, 1998): 

• A diagonal line in a square indicates that that part of the matrix contains 
an identity matrix. 

• Two or more parallel diagonal lines indicate a band-diagonal matrix with 
a corresponding number of Is per row. 

• A horizontal ellipse with an arrow on it indicates that the corresponding 
columns in a block are randomly permuted. 

• A vertical ellipse with an arrow on it indicates that the corresponding 
rows in a block are randomly permuted. 

• An integer surrounded by a circle represents that number of superposed 
random permutation matrices. 

Definition. A generalized parity-check matrix is a pair {A,p}, where A is a 
binary matrix and p is a list of the punctured bits. The matrix defines a set 
of valid vectors x, satisfying 

Ax = 0; (49.2) 

for each valid vector there is a codeword t(x) that is obtained by puncturing 
from x the bits indicated by p. For any one code there are many generalized 
parity-check matrices. 

The rate of a code with generalized parity-check matrix {A,p} can be 
estimated as follows. If A is L x M', and p punctures S bits and selects N 
bits for transmission (L = N + S), then the effective number of constraints on 
the codeword, M, is 

M = M' -S, (49.3) 

the number of source bits is 

K = N — M = L — M', (49.4) 
and the rate is greater than or equal to 

M M'-S , 

R = 1 ~N= 1 -—- ^ 
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3 



H 



3 
3 



Figure 49.5. The generator matrix 
and parity-check matrix of a 
systematic low-density 
generator-matrix code. The code 
has rate Vz. 



G T 



3 
3 



A,p 



3 
3 



Figure 49.6. The generator matrix 
and generalized parity-check 
matrix of a non- systematic 
low-density generator-matrix 
code. The code has rate V2. 



Examples 

Repetition code. The generator matrix, parity-check matrix, and generalized 
parity-check matrix of a simple rate-V3 repetition code are shown in figure 49.4. 



Systematic low-density generator-matrix code. In an (N,K) systematic low- 
density generator-matrix code, there are no state variables. A transmitted 
codeword t of length N is given by 



t = G T s, 



where 



G T 



Ik 
P 



(49.6) 



(49.7) 



with Ik denoting the K x K identity matrix, and P being a very sparse M x K 
matrix, where M = N — K . The parity-check matrix of this code is 



H = fP|I 



(49.8) 



In the case of a rate-V3 code, this parity-check matrix might be represented 
as shown in figure 49.5. 

Non-systematic low-density generator-matrix code. In an (N,K) non-systematic 
low-density generator-matrix code, a transmitted codeword t of length N is 
given by 

t = G T s, (49.9) 

where G T is a very sparse N x K matrix. The generalized parity-check matrix 
of this code is 

A = [CTlIjv] , (49.10) 
and the corresponding generalized parity-check equation is 



Ax = 0, where x 



(49.11) 



Whereas the parity-check matrix of this simple code is typically a com- 
plex, dense matrix, the generalized parity-check matrix retains the underlying 
simplicity of the code. 

In the case of a rate-V2 code, this generalized parity-check matrix might 
be represented as shown in figure 49.6. 

Low-density parity-check codes and linear MN codes. The parity-check matrix 
of a rate-1/3 low-density parity-check code is shown in figure 49.7a. 



(a) 



(b) 



Figure 49.7. The generalized 
parity-check matrices of (a) a 
ratc-V3 Gallager code with M/2 
columns of weight 2; (b) a rate-V2 
linear MN code. 
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49.5: Generalized parity-check matrices 

A linear MN code is a non-systematic low-density parity-check code. The 
K state bits of an MN code are the source bits. Figure 49.7b shows the 
generalized parity-check matrix of a rate- 1/2 linear MN code. 

Convolutional codes. In a non-systematic, non-recursive convolutional code, 
the source bits, which play the role of state bits, are fed into a delay-line and 
two linear functions of the delay-line are transmitted. In figure 49.8a, these 
two parity streams are shown as two successive vectors of length K. [It is 
common to interleave these two parity streams, a bit-reordering that is not 
relevant here, and is not illustrated.] 

Concatenation. 'Parallel concatenation' of two codes is represented in one of 
these diagrams by aligning the matrices of two codes in such a way that the 
'source bits' line up, and by adding blocks of zero-entries to the matrix such 
that the state bits and parity bits of the two codes occupy separate columns. 
An example is given by the turbo code below. In 'serial concatenation', the 
columns corresponding to the transmitted bits of the first code are aligned 
with the columns corresponding to the source bits of the second code. 

Turbo codes. A turbo code is the parallel concatenation of two convolutional 
codes. The generalized parity-check matrix of a rate-1/3 turbo code is shown 
in figure 49.8b. 

Repeat-accumulate codes. The generalized parity-check matrices of a rate-1/3 
repeat-accumulate code is shown in figure 49.9. Repeat-accumulate codes are 
equivalent to staircase codes (section 47.7, p. 569). 

Intersection. The generalized parity-check matrix of the intersection of two 
codes is made by stacking their generalized parity-check matrices on top of 
each other in such a way that all the transmitted bits' columns are correctly 
aligned, and any punctured bits associated with the two component codes 
occupy separate columns. 
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(b) 

Figure 49.8. The generalized 
parity-check matrices of (a) a 
convolutional code with rate V2. 
(b) a rate-V3 turbo code built by 
parallel concatenation of two 
convolutional codes. 




Figure 49.9. The generalized 
parity-check matrix of a 
repeat-accumulate code with rate 

Vs. 
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About Chapter 50 



The following exercise provides a helpful background for digital fountain codes. 

> Exercise 50.1.^ An author proofreads his K = 700-page book by inspecting 
random pages. He makes N page-inspections, and does not take any 
precautions to avoid inspecting the same page twice. 

(a) After N = K page-inspections, what fraction of pages do you expect 
have never been inspected? 

(b) After N > K page-inspections, what is the probability that one or 
more pages have never been inspected? 

(c) Show that in order for the probability that all K pages have been 
inspected to be 1 — 5, we require N ~ Kh\{K/5) page- inspections. 

[This problem is commonly presented in terms of throwing N balls at 
random into K bins; what's the probability that every bin gets at least 
one ball?] 
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50 



Digital Fountain Codes 



Digital fountain codes are record-breaking sparse-graph codes for channels 
with erasures. 

Channels with erasures are of great importance. For example, files sent 
over the internet are chopped into packets, and each packet is either received 
without error or not received. A simple channel model describing this situation 
is a g-ary erasure channel, which has (for all inputs in the input alphabet 
{0, 1,2, ... , q— 1}) a probability 1 — / of transmitting the input without error, 
and probability / of delivering the output '?'. The alphabet size q is 2 l , where 
I is the number of bits in a packet. 

Common methods for communicating over such channels employ a feed- 
back channel from receiver to sender that is used to control the retransmission 
of erased packets. For example, the receiver might send back messages that 
identify the missing packets, which are then retransmitted. Alternatively, the 
receiver might send back messages that acknowledge each received packet; the 
sender keeps track of which packets have been acknowledged and retransmits 
the others until all packets have been acknowledged. 

These simple retransmission protocols have the advantage that they will 
work regardless of the erasure probability /, but purists who have learned their 
Shannon theory will feel that these retransmission protocols are wasteful. If 
the erasure probability / is large, the number of feedback messages sent by 
the first protocol will be large. Under the second protocol, it's likely that the 
receiver will end up receiving multiple redundant copies of some packets, and 
heavy use is made of the feedback channel. According to Shannon, there is no 
need for the feedback channel: the capacity of the forward channel is (1 — f)l 
bits, whether or not we have feedback. 

The wastefulness of the simple retransmission protocols is especially evi- 
dent in the case of a broadcast channel with erasures - channels where one 
sender broadcasts to many receivers, and each receiver receives a random 
fraction (1 — /) of the packets. If every packet that is missed by one or more 
receivers has to be retransmitted, those retransmissions will be terribly re- 
dundant. Every receiver will have already received most of the retransmitted 
packets. 

So, we would like to make erasure-correcting codes that require no feed- 
back or almost no feedback. The classic block codes for erasure correction are 
called Reed-Solomon codes. An (N, K) Reed-Solomon code (over an alpha- 
bet of size q = 2 l ) has the ideal property that if any K of the N transmitted 
symbols are received then the original K source symbols can be recovered. 
[See Berlekamp (1968) or Lin and Costello (1983) for further information; 
Reed-Solomon codes exist for TV < q.] But Reed-Solomon codes have the 
disadvantage that they are practical only for small K, N, and q: standard im- 
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plementations of encoding and decoding have a cost of order K(N—K) log 2 N 
packet operations. Furthermore, with a Reed-Solomon code, as with any block 
code, one must estimate the erasure probability / and choose the code rate 
R = K/N before transmission. If we are unlucky and / is larger than expected 
and the receiver receives fewer than K symbols, what are we to do? We'd like 
a simple way to extend the code on the fly to create a lower-rate (N r , K) code. 
For Reed-Solomon codes, no such on-the-fly method exists. 

There is a better way, pioneered by Michael Luby (2002) at his company 
Digital Fountain, the first company whose business is based on sparse-graph 
codes. 

The digital fountain codes I describe here, LT codes, were invented by 
Luby in 1998. The idea of a digital fountain code is as follows. The encoder is LT stands for 'Luby transform', 
a fountain that produces an endless supply of water drops (encoded packets); 
let's say the original source file has a size of Kl bits, and each drop contains 
I encoded bits. Now, anyone who wishes to receive the encoded file holds a 
bucket under the fountain and collects drops until the number of drops in the 
bucket is a little larger than K. They can then recover the original file. 

Digital fountain codes are rateless in the sense that the number of encoded 
packets that can be generated from the source message is potentially limitless; 
and the number of encoded packets generated can be determined on the fly. 
Regardless of the statistics of the erasure events on the channel, we can send 
as many encoded packets as are needed in order for the decoder to recover 
the source data. The source data can be decoded from any set of K' encoded 
packets, for K' slightly larger than K (in practice, about 5% larger). 

Digital fountain codes also have fantastically small encoding and decod- 
ing complexities. With probability 1 — 8, K packets can be communicated 
with average encoding and decoding costs both of order Kln(K/S) packet 
operations. 

Luby calls these codes universal because they are simultaneously near- 
optimal for every erasure channel, and they are very efficient as the file length 
K grows. The overhead K' — K is of order \fK(\n{K/ &)) 2 . 



► 50.1 A digital fountain's encoder 



Each encoded packet t n is produced from the source file S1S2S3 . . .sk as 
follows: 

1. Randomly choose the deg ree d n of the packet from a degree distri- 
bution p(d); the appropriate choice of p depends on the source file 
size K, as we'll discuss later. 

2. Choose, uniformly at random, d n distinct input packets, and set t n 
equal to the bitwise sum, modulo 2 of those d n packets. This sum 
can be done by successively exclusive-or-ing the packets together. 



This encoding operation defines a graph connecting encoded packets to 
source packets. If the mean degree d is significantly smaller than K then the 
graph is sparse. We can think of the resulting code as an irregular low-density 
generator-matrix code. 

The decoder needs to know the degree of each packet that is received, and 
which source packets it is connected to in the graph. This information can 
be communicated to the decoder in various ways. For example, if the sender 
and receiver have synchronized clocks, they could use identical pseudo-random 
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number generators, seeded by the clock, to choose each random degree and 
each set of connections. Alternatively, the sender could pick a random key, 
K n , given which the degree and the connections are determined by a pseudo- 
random process, and send that key in the header of the packet. As long as the 
packet size I is much bigger than the key size (which need only be 32 bits or 
so), this key introduces only a small overhead cost. 

► 50.2 The decoder 

Decoding a sparse-graph code is especially easy in the case of an erasure chan- 
nel. The decoder's task is to recover s from t = Gs, where G is the matrix 
associated with the graph. The simple way to attempt to solve this prob- 
lem is by message-passing. We can think of the decoding algorithm as the 
sum-product algorithm if we wish, but all messages are either completely un- 
certain messages or completely certain messages. Uncertain messages assert 
that a message packet s k could have any value, with equal probability; certain 
messages assert that s k has a particular value, with probability one. 

This simplicity of the messages allows a simple description of the decoding 
process. We'll call the encoded packets {t n } check nodes. 



1. Find a check node t n that is connected to only one source packet 
Sfe. (If there is no such check node, this decoding algorithm halts at 
this point, and fails to recover all the source packets.) 

(a) Set s k = t n . 

(b) Add Sfc to all checks t n i that are connected to sj.: 

t n ' '■= tn' + s k for ah n ' suc h that G n >k = 1. (50.1) 

(c) Remove all the edges connected to the source packet s k - 

2. Repeat (1) until all {s k } are determined. 



This decoding process is illustrated in figure 50.1 for a toy case where each 
packet is just one bit. There are three source packets (shown by the upper 
circles) and four received packets (shown by the lower check symbols), which 
have the values = 1011 at the start of the algorithm. 

At the first iteration, the only check node that is connected to a sole source 
bit is the first check node (panel a). We set that source bit si accordingly 
(panel b), discard the check node, then add the value of si (1) to the checks to 
which it is connected (panel c), disconnecting si from the graph. At the start 
of the second iteration (panel c), the fourth check node is connected to a sole 
source bit, S2- We set S2 to t\ (0, in panel d), and add S2 to the two checks 
it is connected to (panel e). Finally, we find that two check nodes are both 
connected to S3, and they agree about the value of S3 (as we would hope!), 
which is restored in panel f. 

► 50.3 Designing the degree distribution 

The probability distribution p(d) of the degree is a critical part of the design: 
occasional encoded packets must have high degree (i.e., d similar to K) in 
order to ensure that there are not some source packets that are connected to 
no-one. Many packets must have low degree, so that the decoding process 




1 1 



10 1 

f) o o o 



Figure 50.1. Example decoding for 
a digital fountain code with 
K = 3 source bits and N = 4 
encoded bits. 
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can get started, and keep going, and so that the total number of addition 
operations involved in the encoding and decoding is kept small. For a given 
degree distribution p(d), the statistics of the decoding process can be predicted 
by an appropriate version of density evolution. 

Ideally, to avoid redundancy, we'd like the received graph to have the prop- 
erty that just one check node has degree one at each iteration. At each itera- 
tion, when this check node is processed, the degrees in the graph are reduced 
in such a way that one new degree-one check node appears. In expectation, 
this ideal behaviour is achieved by the idea] soliton distribution, 



p(d) 



1/K 
l 



for d= 2,3, 



(50.2) 



The expected degree under this distribution is roughly In K. 

t> Exercise 50.2. ^ Derive the ideal soliton distribution. At the first iteration 
(t = 0) let the number of packets of degree d be h (d); show that (for 
d > 1) the expected number of packets of degree d that have their degree 
reduced to d — 1 is ho(d)d/K; and at the tth iteration, when t of the 
K packets have been recovered and the number of packets of degree d 
is ht(d), the expected number of packets of degree d that have their 
degree reduced to d — 1 is ht(d)d/ K — t. Hence show that in order to 
have the expected number of packets of degree 1 satisfy h t (l) = 1 for all 
t€ {0, ... K— 1}, we must to start with have ho(l) = 1 and ho(2) = K/2; 
and more generally, h t (2) = (K — t)/2; then by recursion solve for ho(d) 
for d = 3 upwards. 

This degree distribution works poorly in practice, because fluctuations 
around the expected behaviour make it very likely that at some point in the 
decoding process there will be no degree-one check nodes; and, furthermore, a 
few source nodes will receive no connections at all. A small modification fixes 
these problems. 

The robust soliton distribution has two extra parameters, c and 5; it is 
designed to ensure that the expected number of degree-one checks is about 

S = c]n(K/6)VK, (50.3) 

rather than 1, throughout the decoding process. The parameter S is a bound 
on the probability that the decoding fails to run to completion after a certain 
number K' of packets have been received. The parameter c is a constant of 
order 1, if our aim is to prove Luby's main theorem about IT codes; in practice 
however it can be viewed as a free parameter, with a value somewhat smaller 
than 1 giving good results. We define a positive function 



r(d) = 



S_l 

K 
S 



for d= 1,2,... (if/5) -1 
fln(5/<5) for d= K/S 
n for d > K/S 



(50.4) 







(see figure 50.2 and exercise 50.4 (p. 594)) then add the ideal soliton distribu- 
tion p to r and normalize to obtain the robust soliton distribution, p: 



p{d) = 



P {d) + T(d) 



(50.5) 



where Z = ^2 d p(d) + r(d). The number of encoded packets required at the 
receiving end to ensure that the decoding can run to completion, with proba- 
bility at least 1 - 5, is K ' = KZ. 
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Figure 50.2. The distributions 
p(d) and r(d) for the case 
K = 10 000, c = 0.2, 5 = 0.05, 
which gives S = 244, K/S = 41, 
and Z ~ 1.3. The distribution r is 
largest at d = 1 and d = K/S. 
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Figure 50.3. The number of 
degree-one checks S (upper figure) 
and the quantity K' (lower figure) 
as a function of the two 
parameters c and 5, for 
K = 10 000. Luby's main theorem 
proves that there exists a value of 
c such that, given K' received 
packets, the decoding algorithm 
will recover the K source packets 
with probability 1 — 5. 
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Luby's (2002) analysis explains how the small-d end of r has the role of 
ensuring that the decoding process gets started, and the spike in r at d = K/S 
is included to ensure that every source packet is likely to be connected to a 
check at least once. Luby's key result is that (for an appropriate value of the 
constant c) receiving K' = K + 2 ln(S/S)S checks ensures that all packets can 
be recovered with probability at least 1 — 5. In the illustrative figures I have 
set the allowable decoder failure probability S quite large, because the actual 
failure probability is much smaller than is suggested by Luby's conservative 
analysis. 

In practice, LT codes can be tuned so that a file of original size K ~ 10 000 
packets is recovered with an overhead of about 5%. Figure 50.4 shows his- 
tograms of the actual number of packets required for a couple of settings of 
the parameters, achieving mean overheads smaller than 5% and 10% respec- 
tively. 
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► 50.4 Applications 

Digital fountain codes are an excellent solution in a wide variety of situations. 
Let's mention two. 

Storage 

You wish to make a backup of a large file, but you are aware that your magnetic 
tapes and hard drives are all unreliable in the sense that catastrophic failures, 
in which some stored packets are permanently lost within one device, occur at 
a rate of something like 10~ 3 per day. How should you store your file? 

A digital fountain can be used to spray encoded packets all over the place, 
on every storage device available. Then to recover the backup file, whose size 
was K packets, one simply needs to find K' ~ K packets from anywhere. 
Corrupted packets do not matter; we simply skip over them and find more 
packets elsewhere. 

This method of storage also has advantages in terms of speed of file re- 
covery. In a hard drive, it is standard practice to store a file in successive 
sectors of a hard drive, to allow rapid reading of the file; but if, as occasion- 
ally happens, a packet is lost (owing to the reading head being off track for 
a moment, giving a burst of errors that cannot be corrected by the packet's 
error-correcting code), a whole revolution of the drive must be performed to 
bring back the packet to the head for a second read. The time taken for one 
revolution produces an undesirable delay in the file system. 

If files were instead stored using the digital fountain principle, with the 
digital drops stored in one or more consecutive sectors on the drive, then one 
would never need to endure the delay of re-reading a packet; packet loss would 
become less important, and the hard drive could consequently be operated 
faster, with higher noise level, and with fewer resources devoted to noisy- 
channel coding. 

> Exercise 50.3. ^ Compare the digital fountain method of robust storage on 
multiple hard drives with RAID (the redundant array of independent 
disks). 



Figure 50.4. Histograms of the 
actual number of packets N 
required in order to recover a file 
of size K = 10 000 packets. The 
parameters were as follows: 
top histogram: c = 0.01, 5 = 0.5 
(S = 10, K/S = 1010, and 
Z ~ 1.01); 

middle: c = 0.03, 5 = 0.5 (S = 30, 
AYS = 337, and Z~ 1.03); 
bottom: c = 0.1, S = 0.5 (S = 99, 
K/S = 101, and Z ~ 1.1). 



Broadcast 



Imagine that ten thousand subscribers in an area wish to receive a digital 
movie from a broadcaster. The broadcaster can send the movie in packets 
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over a broadcast network - for example, by a wide-bandwidth phone line, or 
by satellite. 

Imagine that not all packets are received at all the houses. Let's say 
/ = 0.1% of them are lost at each house. In a standard approach in which the 
file is transmitted as a plain sequence of packets with no encoding, each house 
would have to notify the broadcaster of the f'K missing packets, and request 
that they be retransmitted. And with ten thousand subscribers all requesting 
such retransmissions, there would be a retransmission request for almost every 
packet. Thus the broadcaster would have to repeat the entire broadcast twice 
in order to ensure that most subscribers have received the whole movie, and 
most users would have to wait roughly twice as long as the ideal time before 
the download was complete. 

If the broadcaster uses a digital fountain to encode the movie, each sub- 
scriber can recover the movie from any K' ~ K packets. So the broadcast 
needs to last for only, say, 1.1K packets, and every house is very likely to have 
successfully recovered the whole file. 

Another application is broadcasting data to cars. Imagine that we want to 
send updates to in-car navigation databases by satellite. There are hundreds 
of thousands of vehicles, and they can only receive data when they are out 
on the open road; there are no feedback channels. A standard method for 
sending the data is to put it in a carousel, broadcasting the packets in a fixed 
periodic sequence. 'Yes, a car may go through a tunnel, and miss out on a 
few hundred packets, but it will be able to collect those missed packets an 
hour later when the carousel has gone through a full revolution (we hope) ; or 
maybe the following day. . .' 

If instead the satellite uses a digital fountain, each car needs to receive 
only an amount of data equal to the original file size (plus 5%). 



Further reading 

The encoders and decoders sold by Digital Fountain have even higher efficiency 
than the LT codes described here, and they work well for all blocklengths, not 
only large lengths such as K > 10 000. Shokrollahi (2003) presents Raptor 
codes, which are an extension of LT codes with linear-time encoding and de- 
coding. 



► 50.5 Further exercises 

> Exercise 50.4. ^ Understanding the robust soliton distribution. 

Repeat the analysis of exercise 50.2 (p. 592) but now aim to have the 
expected number of packets of degree 1 be h t (l) = 1 + S for all t, instead 
of 1. Show that the initial required number of packets is 

The reason for truncating the second term beyond d = K/S and replac- 
ing it by the spike at d = K/S (see equation (50.4)) is to ensure that 
the decoding complexity does not grow larger than 0(K In K). 

Estimate the expected number of packets X^d^o(^) an d t ne expected 
number of edges in the sparse graph Y^dho{d)d (which determines the 
decoding complexity) if the histogram of packets is as given in (50.6). 
Compare with the expected numbers of packets and edges when the 
robust soliton distribution (50.4) is used. 
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Exercise 50.5.^-' Show that the spike at d = K/S (equation (50.4)) is an ade- 
quate replacement for the tail of high- weight packets in (50.6). 

Exercise 50.6. Investigate experimentally how necessary the spike at d = 
K/S (equation (50.4)) is for successful decoding. Investigate also whether 
the tail of p{d) beyond d = K/S is necessary. What happens if all high- 
weight degrees are removed, both the spike at d = K/S and the tail of 
p{d) beyond d = K/S? 

Exercise 50.7. ^ Fill in the details in the proof of Luby's main theorem, that 
receiving K' = K + 2 ln(S/5)S checks ensures that all the source packets 
can be recovered with probability at least 1 — 5. 

Exercise 50.8. Optimize the degree distribution of a digital fountain code 
for a file of K = 10 000 packets. Pick a sensible objective function for 
your optimization, such as minimizing the mean of N, the number of 
packets required for complete decoding, or the 95th percentile of the 
histogram of N (figure 50.4). 

i> Exercise 50.9. ^ Make a model of the situation where a data stream is broad- 
cast to cars, and quantify the advantage that the digital fountain has 
over the carousel method. 

Exercise 50.10. ^ Construct a simple example to illustrate the fact that the 
digital fountain decoder of section 50.2 is suboptimal - it sometimes 
gives up even though the information available is sufficient to decode 
the whole file. How does the cost of the optimal decoder compare? 

> Exercise 50.11. ^ If every transmitted packet were created by adding together 

source packets at random with probability V2 of each source packet's 
being included, show that the probability that K' = K received packets 
suffice for the optimal decoder to be able to recover the K source packets 
is just a little below 1/2. [To put it another way, what is the probability 
that a random K x K matrix has full rank?] 

Show that if K' = K + A packets are received, the probability that they 
will not suffice for the optimal decoder is roughly 2~ A . 

> Exercise 50.12. Implement an optimal digital fountain decoder that uses 

the method of Richardson and Urbanke (2001b) derived for fast encod- 
ing of sparse-graph codes (section 47.7) to handle the matrix inversion 
required for optimal decoding. Now that you have changed the decoder, 
you can reoptimize the degree distribution, using higher-weight packets. 
By how much can you reduce the overhead? Confirm the assertion that 
this approach makes digital fountain codes viable as erasure-correcting 
codes for all blocklengths, not just the large blocklengths for which LT 
codes are excellent. 

> Exercise 50.13. ^ Digital fountain codes are excellent rateless codes for erasure 

channels. Make a rateless code for a channel that has both erasures and 
noise. 
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► 50.6 Summary of sparse-graph codes 

A simple method for designing error-correcting codes for noisy channels, first 
pioneered by Gallager (1962), has recently been rediscovered and generalized, 
and communication theory has been transformed. The practical performance 
of Gallager's low-density parity-check codes and their modern cousins is vastly 
better than the performance of the codes with which textbooks have been filled 
in the intervening years. 

Which sparse-graph code is 'best' for a noisy channel depends on the cho- 
sen rate and blocklength, the permitted encoding and decoding complexity, 
and the question of whether occasional undetected errors are acceptable. Low- 
density parity-check codes are the most versatile; it's easy to make a competi- 
tive low-density parity-check code with almost any rate and blocklength, and 
low-density parity-check codes virtually never make undetected errors. 

For the special case of the erasure channel, the sparse-graph codes that are 
best are digital fountain codes. 

► 50.7 Conclusion 

The best solution to the communication problem is: 



Combine a simple, pseudo-random code 
with a message-passing decoder. 
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Appendices 
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A 



Notation 



What does P(A | B, C) mean? P(A \ B,C) is pronounced 'the probability 
that A is true given that B is true and C is true'. Or, more briefly, 'the 
probability of A given B and C". 

What do log and In mean? In this book, logx means the base-two loga- 
rithm, log 2 :r; Inx means the natural logarithm, log e x. 

What does s mean? Usually, a 'hat' over a variable denotes a guess or es- 
timator. So s is a guess at the value of s. 

Integrals. There is no difference between J f(u) du and J du f(u). The inte- 
grand is f(u) in both cases. 

N 

What does ]^[ mean? This is like the summation J2n=i ^ ut ^ denotes a 

n=l 

product. It's pronounced 'product over n from 1 to N'. So, for example, 



N 



Qrj = 1 x 2 x 3 x • • • x N = N\ = exp 



n=l 



N 



E lni 

,n=i 



(A.l) 



I like to choose the name of the free variable in a sum or a product - 
here, n - to be the lower case version of the range of the sum. So n 
usually runs from 1 to N, and m usually runs from 1 to M. This is a 
habit I learnt from Yaser Abu-Mostafa, and I think it makes formulae 
easier to understand. 

/ N\ 

What does mean? This is pronounced 'N choose n\ and it is the 

w 

number of ways of selecting an unordered set of n objects from a set of 
size N . 

'N\ _ N\ 
n)~ (N - n)\ n\ ' ( ' 

This function is known as the combination function. 

What is T(x)? The gamma function is defined by T(x) = du u x ~ l e~ u , 
for x > 0. The gamma function is an extension of the factorial function 
to real number arguments. In general, T(x+ 1) = xT(x), and for integer 
arguments, T(x + 1) = x\. The digamma function is defined by W(x) = 

For large x (for practical purposes, 0.1 < x < oo), 

logTfV) ~ (x - |) log(:r) -x+ ilog27r+ 0(l/x); (A.3) 



598 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 



A — Notation 



and for small x (for practical purposes, < x < 0.5): 

log T(x) - log ^ - lex + 0(x 2 ) (A A) 

where j e is Euler's constant. 

What does H^(l - R/C) mean? Just as sin (s) denotes the inverse func- 
tion to s = sin(:r), so i7 2 _1 (/i) is the inverse function to h = H2(x). 

There is potential confusion when people use sin 2 a; to denote (sin a;) 2 , 
since then we might expect sin -1 s to denote l/sin(s); I therefore like 
to avoid using the notation sin 2 x. 

What does f'(x) mean? The answer depends on the context. Often, a 
'prime' is used to denote differentiation: 

f(x) = ±f(x); (A.5) 
similarly, a dot denotes differentiation with respect to time, t: 

iE ?' (A - 6) 
However, the prime is also a useful indicator for 'another variable', for 
example 'a new value for a variable'. So, for example, x' might denote 
'the new value of x\ Also, if there are two integers that both range from 
1 to N, I will often name those integers n and n' . 

So my rule is: if a prime occurs in an expression that could be a func- 
tion, such as f'(x) or h'(y), then it denotes differentiation; otherwise it 
indicates 'another variable'. 

What is the error function? Definitions of this function vary. I define it 
to be the cumulative probability of a standard (variance = 1) normal 
distribution, 

$(z) = [ exp(-z 2 /2)/\/2^ dz. (A.7) 

J — oo 

What does £{r) mean? £[r] is pronounced 'the expected value of r' or 'the 
expectation of r\ and it is the mean value of r. Another symbol for 
'expected value' is the pair of angle-brackets, (r). 

What does \x\ mean? The vertical bars '| • |' have two meanings. If A is a 
set, then |^4| denotes the number of elements in the set; if a; is a number, 
then | a; | is the absolute value of x. 

What does [A|P] mean? Here, A and P are matrices with the same num- 
ber of rows. [A|P] denotes the double-width matrix obtained by putting 
A alongside P. The vertical bar is used to avoid confusion with the 
product AP. 

What does x T mean? The superscript t is pronounced 'transpose'. Trans- 
posing a row-vector turns it into a column vector: 



(1,2,3) T = 2 , (A.8) 




and vice versa. [Normally my vectors, indicated by bold face type (x), 
are column vectors.] 

Similarly, matrices can be transposed. If My is the entry in row i and 
column j of matrix M, and N = M T , then Nji = Mij. 
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What are Trace M and det M? The trace of a matrix is the sum of its di- 
agonal elements, 

Trace M = ^M ti . (A.9) 

i 

The determinant of M is denoted det M. 
What does 5 mn mean? The 5 matrix is the identity matrix. 



fl if 
\ if 



m = n 
m jtz n. 



Another name for the identity matrix is I or 1. Sometimes I include a 
subscript on this symbol - Ik - which indicates the size of the matrix 
(K x K). 

What does 5(x) mean? The delta function has the property 

dx f(x)S(x) = f(0). (A.10) 



Another possible meaning for S(S) is the truth function, which is 1 if the 
proposition S is true but I have adopted another notation for that. After 
all, the symbol 8 is quite busy already, with the two roles mentioned above 
in addition to its role as a small real number S and an increment operator 
(as in Sx)\ 

What does 1[«S] mean? 1[S] is the truth function, which is 1 if the propo- 
sition S is true and otherwise. For example, the number of positive 
numbers in the set T = {—2, 1, 3} can be written 

£>[*><)]. (A.ll) 

What is the difference between ':=' and '=' ? In an algorithm, x := y 
means that the variable x is updated by assigning it the value of y. 

In contrast, x = y is a proposition, a statement that x is equal to y. 



See Chapters 23 and 29 for further definitions and notation relating to 
probability distributions. 
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B 



Some Physics 



► B.l About phase transitions 

A system with states x in contact with a heat bath at temperature T = 1//3 
has probability distribution 

p(x|/3)= zM exp(_/3i?(x)) - (R1) 

The partition function is 

Z(/3) = ^exp(-/3£(x)). (B.2) 

X 

The inverse temperature (3 can be interpreted as defining an exchange rate 
between entropy and energy. is the amount of energy that must be 

given to a heat bath to increase its entropy by one nat. 

Often, the system will be affected by some other parameters such as the 
volume of the box it is in, V , in which case Z is a function of V too, Z(f3, V). 

For any system with a finite number of states, the function Z{(3) is evi- 
dently a continuous function of /3, since it is simply a sum of exponentials. 
Moreover, all the derivatives of Z(f3) with respect to [3 are continuous too. 

What phase transitions are all about, however, is this: phase transitions 
correspond to values of (3 and V (called critical points) at which the derivatives 
of Z have discontinuities or divergences. 

Immediately we can deduce: 



Only systems with an infinite number of states can show phase 
transitions. 



Often, we include a parameter TV describing the size of the system. Phase 
transitions may appear in the limit TV — > oo. Real systems may have a value 
of TV like 10 23 . 

If we make the system large by simply grouping together TV independent 
systems whose partition function is Z^((3), then nothing interesting happens. 
The partition function for TV independent identical systems is simply 

Z(N)(P) = [Z m (P)} N - (B.3) 

Now, while this function Z^{[3) may be a very rapidly varying function of (3, 
that doesn't mean it is showing phase transitions. The natural way to look at 
the partition function is in the logarithm 

In Z {N) ((3)=N In Z {1) (f3). (B.4) 
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Duplicating the original system N times simply scales up all properties like 
the energy and heat capacity of the system by a factor of N. So if the original 
system showed no phase transitions then the scaled up system won't have any 
either. 



Only systems with long-range correlations show phase transitions. 



Long-range correlations do not require long-range energetic couplings; for 
example, a magnet has only short-range couplings (between adjacent spins) 
but these are sufficient to create long-range order. 



Why are points at which derivatives diverge interesting? 

The derivatives of In Z describe properties like the heat capacity of the sys- 
tem (that's the second derivative) or its fluctuations in energy. If the second 
derivative of InZ diverges at a temperature 1/(3, then the heat capacity of the 
system diverges there, which means it can absorb or release energy without 
changing temperature (think of ice melting in ice water); when the system is 
at equilibrium at that temperature, its energy fluctuates a lot, in contrast to 
the normal law-of-large-numbers behaviour, where the energy only varies by 
one part in y/~N. 



A toy system that shows a phase transition 

Imagine a collection of N coupled spins that have the following energy as a 
function of their state x G {0, 1} N . 

£(x) = (-f x = (0,0,0,... ,0) 
[ otherwise. 

This energy function describes a ground state in which all the spins are aligned 
in the zero direction; the energy per spin in this state is — e. if any spin 
changes state then the energy is zero. This model is like an extreme version 
of a magnetic interaction, which encourages pairs of spins to be aligned. 

We can contrast it with an ordinary system of N independent spins whose 
energy is: 

£°(x) = e^(2x„-l). (B.6) 

n 

Like the first system, the system of independent spins has a single ground 
state (0, 0, 0, . . . , 0) with energy —Ne, and it has roughly 2^ states with energy 
very close to 0, so the low-temperature and high-temperature properties of the 
independent-spin system and the coupled-spin system are virtually identical. 
The partition function of the coupled-spin system is given by 

Z(p) = e 0Ne + 2 N - 1. (B.7) 

The function 

In Z{(3) = In (e? Ne + 2 N - l) (B.8) 
is sketched in figure B.la along with its low temperature behaviour, 

InZ(P) ~ Nfie, /3^oo, (B.9) 
and its high temperature behaviour, 



InZ(fJ) ~ 7Vln2, /?->(). 



(B.10) 
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var(E) N=24 
var(E) N=8 



(a) 



beta 



(c) 



beta 




(b) 



beta 



Figure B.l. (a) Partition function 
of toy system which shows a phase 
transition for large N. The arrow 
marks the point (3 C = log2/e. (b) 
The same, for larger N. 
(c) The variance of the energy of 
the system as a function of /? for 
two system sizes. As N increases 
the variance has an increasingly 
sharp peak at the critical point f3 c . 
Contrast with figure B.2. 




(a) 



beta 



(b) 



beta 



Figure B.2. The partition function 
(a) and energy- variance (b) of a 
system consisting of N 
independent spins. The partition 
function changes gradually from 
one asymptote to the other, 
regardless of how large N is; the 
variance of the energy docs not 
have a peak. The fluctuations are 
largest at high temperature (small 
(3) and scale linearly with system 
size N . 



The arrow marks the point 



0=™ 
e 



(B.ll) 



at which these two asymptotes intersect. In the limit N — > oo, the graph of 
InZ(P) becomes more and more sharply bent at this point (figure B.lb). 

The second derivative of In Z, which describes the variance of the energy 
of the system, has a peak value, at (3 = In 2 /e, roughly equal to 



N 2 e 2 



(B.12) 



which corresponds to the system spending half of its time in the ground state 
and half its time in the other states. 

At this critical point, the heat capacity of this system is thus proportional 
to A?" 2 ; the heat capacity per spin is proportional to N, which, for infinite N, is 
infinite, in contrast to the behaviour of systems away from phase transitions, 
whose capacity per atom is a finite number. 

For comparison, figure B.2 shows the partition function and energy- variance 
of the ordinary independent-spin system. 



More generally 

Phase transitions can be categorized into 'first-order' and 'continuous' transi- 
tions. In a first-order phase transition, there is a discontinuous change of one 



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted, http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 

604 B — Some Physics 



or more order-parameters; in a continuous transition, all order-parameters 
change continuously. [What's an order-parameter? - a scalar function of the 
state of the system; or, to be precise, the expectation of such a function.] 

In the vicinity of a critical point, the concept of 'typicality' defined in 
Chapter 4 does not hold. For example, our toy system, at its critical point, 
has a 50% chance of being in a state with energy —Ne, and roughly a 1/2 N+1 
chance of being in each of the other states that have energy zero. It is thus not 
the case that In 1/P(x) is very likely to be close to the entropy of the system 
at this point, unlike a system with N i.i.d. components. 

Remember that information content (In 1 /P(x)) and energy are very closely 
related. If typicality holds, then the system's energy has negligible fluctua- 
tions, and vice versa. 
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► C.l Finite field theory 

Most linear codes are expressed in the language of Galois theory 

Why are Galois fields an appropriate language for linear codes? First, a defi- 
nition and some examples. 

A field F is a set F = {0, F'} such that 

1. F forms an Abelian group under an addition operation '+', with 
being the identity; [Abelian means all elements commute, i.e., 
satisfy a + b = b + a.] 

2. F' forms an Abelian group under a multiplication operation 
multiplication of any element by yields 0; 

3. these operations satisfy the distributive rule (a+b) ■ c= a-c+b-c. 

For example, the real numbers form a field, with '+' and '•' denoting 
ordinary addition and multiplication. 

A Galois field GF(q) is a field with a finite number of elements q. 

A unique Galois field exists for any q = p m , where p is a prime number 
and m is a positive integer; there are no other finite fields. 

GF(2). The addition and multiplication tables for GF{2) are shown in ta- 
ble C.l. These are the rules of addition and multiplication modulo 2. 

GF{p). For any prime number p, the addition and multiplication rules are 
those for ordinary addition and multiplication, modulo p. 

GF(A). The rules for GF(p m ), with m > 1, are not those of ordinary addition 
and multiplication. For example the tables for GF(4) (table C.2) are not 
the rules of addition and multiplication modulo 4. Notice that 1 + 1 = 0, 
for example. So how can GF(A) be described? It turns out that the 
elements can be related to polynomials. Consider polynomial functions 
of x of degree 1 and with coefficients that are elements of GF(2). The 
polynomials shown in table C.3 obey the addition and multiplication 
rules of GF(4) if addition and multiplication are modulo the polynomial 
x 2 + x + 1, and the coefficients of the polynomials are from GF(2). For 
example, B ■ B = x 2 + (1 + l)x + 1 = x = A. Each element may also be 
represented as a bit pattern as shown in table C.3, with addition being 
bitwise modulo 2, and multiplication defined with an appropriate carry 
operation. 



+ 





1 







1 








1 











1 


1 





1 





1 



Table C.l. Addition and 
multiplication tables for GF(2). 



+ 





1 


A 


B 








1 


A 


B 


1 


1 





B 


A 


A 


A 


B 





1 


B 


B 


A 


1 










1 


A 


B 

















1 





1 


A 


B 


A 





A 


B 


1 


B 





B 


1 


A 



Table C.2. Addition and 
multiplication tables for GF(A). 



Element 


Polynomial Bit pattern 





00 


1 


1 01 


A 


x 10 


B 


x + 1 11 



Table C.3. Representations of the 
elements of GF(4). 




Some Mathematics 
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GF(8). We can denote the elements of GF(8) by {0, 1, A, B, C, D, E, F}. Each 
element can be mapped onto a polynomial over GF{2). The multiplica- 
tion and addition operations are given by multiplication and addition of 
the polynomials, modulo x 3, + x + 1. The multiplication table is given 
below. 



element 


polynomial 


binary representation 







1 


A 


B 


C 


D 


E 


F 








000 





























1 


1 


001 


1 





1 


A 


B 


C 


D 


E 


F 


A 


X 


010 


A 





A 


C 


E 


B 


1 


F 


D 


B 


x + 1 


011 


B 





B 


E 


D 


F 


C 


1 


A 


C 


x 1 


100 


C 





C 


B 


F 


E 


A 


D 


1 


D 


x 2 + 1 


101 


D 





D 


1 


C 


A 


F 


B 


E 


E 


x 2 + X 


110 


E 





E 


F 


1 


D 


B 


A 


C 


F 


x 2 + x+l 


111 


F 





F 


D 


A 


1 


E 


C 


B 



Why are Galois fields relevant to linear codes? Imagine generalizing a binary 
generator matrix G and binary vector s to a matrix and vector with elements 
from a larger set, and generalizing the addition and multiplication operations 
that define the product Gs. In order to produce an appropriate input for 
a symmetric channel, it would be convenient if, for random s, the product 
Gs produced all elements in the enlarged set with equal probability. This 
uniform distribution is easiest to guarantee if these elements form a group 
under both addition and multiplication, because then these operations do not 
break the symmetry among the elements. When two random elements of a 
multiplicative group are multiplied together, all elements are produced with 
equal probability. This is not true of other sets such as the integers, for which 
the multiplication operation is more likely to give rise to some elements (the 
composite numbers) than others. Galois fields, by their definition, avoid such 
symmetry-breaking effects. 



► C.2 Eigenvectors and eigenvalues 

A right-eigenvector of a square matrix A is a non-zero vector eR that satisfies 

Ae R = Ae R , (C.l) 

where A is the eigenvalue associated with that eigenvector. The eigenvalue 
may be a real number or complex number and it may be zero. Eigenvectors 
may be real or complex. 

A left-eigenvector of a matrix A is a vector eL that satisfies 

e[A = Ae[. (C.2) 

The following statements for right-eigenvectors also apply to left-eigenvectors 

• If a matrix has two or more linearly independent right-eigenvectors with 
the same eigenvalue then that eigenvalue is called a degenerate eigenvalue 
of the matrix, or a repeated eigenvalue. Any linear combination of those 
eigenvectors is another right-eigenvector with the same eigenvalue. 

• The principal right-eigenvector of a matrix is, by definition, the right- 
eigenvector with the largest associated eigenvalue. 

• If a real matrix has a right-eigenvector with complex eigenvalue A = 
x + yi then it also has a right-eigenvector with the conjugate eigenvalue 
A* = x — yi. 
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C.2: Eigenvectors and eigenvalues 
Symmetric matrices 

If A is a real symmetric N x N matrix then 

1. all the eigenvalues and eigenvectors of A are real; 

2. every left-eigenvector of A is also a right-eigenvector of A with the same 
eigenvalue, and vice versa; 

3. a set of TV eigenvectors and eigenvalues {e( a ', X a }a=i can be found that 
are orthonormal, that is, 

e^-e^ = S ab ; (C.3) 

the matrix can be expressed as a weighted sum of outer products of the 
eigenvectors: 

N 

A = ^A a [e( a )][eW] T . (C.4) 

a=l 

(Whereas I often use i and n as indices for sets of size / and N, I will use the 
indices a and b to run over eigenvectors, even if there are N of them. This is 
to avoid confusion with the components of the eigenvectors, which are indexed 
by n, e.g. e^ .) 

General square matrices 

An N x N matrix can have up to N distinct eigenvalues. Generically, there 
are TV eigenvalues, all distinct, and each has one left-eigenvector and one right- 
eigenvector. In cases where two or more eigenvalues coincide, for each distinct 
eigenvalue that is non-zero there is at least one left-eigenvector and one right- 
eigenvector. 

Left- and right-eigenvectors that have different eigenvalue are orthogonal, 
that is, 

if Xajt Afcthen e^-ej^ = 0. (C.5) 

Non-negative matrices 

Definition. If all the elements of a non-zero matrix C satisfy C mn > then C 
is a non-negative matrix. Similarly, if all the elements of a non-zero vector c 
satisfy c n > then c is a non-negative vector. 

Properties. A non-negative matrix has a principal eigenvector that is non- 
negative. It may also have other eigenvectors with the same eigenvalue that 
are not non-negative. But if the principal eigenvalue of a non-negative matrix 
is not degenerate, then the matrix has only one principal eigenvector e^, and 
it is non-negative. 

Generically, all the other eigenvalues are smaller in absolute magnitude. 
[There can be several eigenvalues of identical magnitude in special cases.] 

Transition probability matrices 

An important example of a non-negative matrix is a transition probability 
matrix Q. 

Definition. A transition probability matrix Q has columns that are probability 
vectors, that is, it satisfies Q > and 



Qij = 1 for all j. 



(C.6) 
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Matrix 



Eigenvalues and eigenvectors ei_,eR 





"12 0" 
1 1 
1 


2.41 




1 






0.41 






.58 




.82 














-.58 




-.82 








.82 




.58 














.82 




.58 


















1 




1 















1 

1 1 



1.62 



-0.62 



.53" 




.53" 




.85" 




.85" 


.85 




.85 




-.53 




-.53 



110 
1 
10 
11 



1.62 



60 




.60 




.l-.5i 


37 




.37 




-.3-.4-Z 


37 




.37 




.3+.4i 


60 




.60 




-.l + .5i 



0.5+0.9i 

.l-.5i' 
.3+.4i 

-.3-Ai 
-.l+.5i 



0.5 
.l+.5i~ 

-.3+Ai 
.3-Ai 

-.l-.5i 



0.9i 
.l + .5i 
.3-Ai 
-.3+.4i 
-.l-.5i 



-0.62 



37 




.37 


60 




-.60 


60 




-.60 


37 




.37 



Table C.4. Some matrices and their eigenvectors. 



Matrix Eigenvalues and eigenvectors ei_,eR 



'.71" 




.36" 




-.93" 




-.71" 


.71 




.93 




.36 




.71 



.35 

.46 

1 .65 .54 





1 




-0.2- 


0.3i 


-0.2+0.3i 


".58" 




".14" 




"-.8+.i-r 




.2-.5f 




~-.8-.li~ 




.2 + .5i 


.58 




.41 




-.2-.5i 




-.6+.2i 




-.2+.5i 




-.6-.2i 


.58 




.90 




.2 + .2i 




A+.3i 




.2-.2i 




A-.3i 



Table C.5. Transition probability matrices for generating random paths through trellises. 

This property can be rewritten in terms of the all-ones vector n = (1,1,..., 1) T : 

n T Q = n T . (C.7) 

So n is the principal left-eigenvector of Q with eigenvalue Ai = 1. 

e[_ 1} = n. (C.8) 

Because it is a non-negative matrix, Q has a principal right-eigenvector that 
is non-negative, e^ 1 '. Generically, for Markov processes that are ergodic, this 
eigenvector is the only right-eigenvector with eigenvalue of magnitude 1 (see 
table C.6 for illustrative exceptions). This vector, if we normalize it such that 
e^'-n = 1, is called the invariant distribution of the transition probability 
matrix. It is the probability density that is left unchanged under Q. Unlike 
the principal left-eigenvector, which we explicitly identified above, we can't 
usually identify the principal right-eigenvector without computation. 

The matrix may have up to N — 1 other right-eigenvectors all of which are 
orthogonal to the left-eigenvector n, that is, they are zero-sum vectors. 



► C.3 Perturbation theory 

Perturbation theory is not used in this book, but it is useful in this book's 
fields. In this section we derive first order perturbation theory for the eigen- 
vectors and eigenvalues of square, not necessarily symmetric, matrices. Most 
presentations of perturbation theory focus on symmetric matrices, but non- 
symmetric matrices (such as transition matrices) also deserve to be perturbed! 
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C.3: Perturbation theory 



Matrix 



Eigenvalues and eigenvectors ei_,eR 



(a) 



.90 .20 

.10 .80 

.90 .20 

.10 .80 



0.70 



0.70 













.71 




.89 




.45 




.71 






















.71 




.45 




-.89 




-.71 












.71 




.89 
























-.45 




-.71 


.71 




.45 
























.89 




.71 


1 




0.98 




0.70 




0.69 



(a') 



.90 .20 

.10 .79 .02 

.01 .88 .20 

.10 .80 



.50 




.87 




-.18 




-.66 




.20 




.63 




-.19 




-.61 


.50 




.43 




-.15 




-.28 




-.40 




-.63 




.41 




.65 


.50 




.22 




.66 




.61 




-.40 




-.32 




-.44 




-.35 


.50 




.11 




.72 




.33 




.80 




.32 




.77 




.30 




1 




0.70 




0.70 




-1 


.50 




.63 




-.32 




.50 




.32 




-.50 




.50 




.63 


.50 




.32 




.63 




-.50 




-.63 




.50 




.50 




.32 


.50 




.63 




-.32 




.50 




-.32 




.50 




-.50 




-.63 


.50 




.32 




.63 




-.50 




.63 




-.50 




-.50 




-.32 



(b) 



.90 .20 

.10 .80 

.90 .20 

.10 .80 



Table C.6. Illustrative transition probability matrices and their eigenvectors showing the two ways of 
being non-ergodic. (a) More than one principal eigenvector with eigenvalue 1 because the 
state space falls into two unconnected pieces, (a') A small perturbation breaks the degen- 
eracy of the principal eigenvectors, (b) Under this chain, the density may oscillate between 
two parts of the state space. In addition to the invariant distribution, there is another 
right-eigenvector with eigenvalue —1. In general such circulating densities correspond to 
complex eigenvalues with magnitude 1. 



We assume that we have an N x N matrix H that is a function H(e) of 
a real parameter e, with e = being our starting point. We assume that a 
Taylor expansion of H(e) is appropriate: 

H(e) = H(0) + eV + ■ ■ ■ (C.9) 



where 

V^. (CIO) 

We assume that for all e of interest, H(e) has a complete set of N right- 
eigenvectors and left-eigenvectors, and that these eigenvectors and their eigen- 
values are continuous functions of e. This last assumption is not necessarily a 
good one: if H(0) has degenerate eigenvalues then it is possible for the eigen- 
vectors to be discontinuous in e; in such cases, degenerate perturbation theory 
is needed. That's a fun topic, but let's stick with the non-degenerate case 
here. 

We write the eigenvectors and eigenvalues as follows: 

H(e)e^(e)=\^(e)e^(e), (C.ll) 

and we Taylor- expand 

A (a)( e ) = A (a)( ) + e/i (a) + ... ( C12 ) 



with 



and 



» 



0A(°>(e) 



de 



4°>(6)=eW(0) + ef£ 



,(a) 



(C.13) 
(C.14) 
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with 

4 a) - (C15) 

and similar definitions for e[°^ and We define these left- vectors to be row 
vectors, so that the 'transpose' operation is not needed and can be banished. 

We are free to constrain the magnitudes of the eigenvectors in whatever way 
we please. Each left-eigenvector and each right-eigenvector has an arbitrary 
magnitude. The natural constraints to use are as follows. First, we constrain 
the inner products with: 

e[ a) (e)e[ ? o) (e) = 1, for all a. (C.16) 
Expanding the eigenvectors in e, equation (C.19) implies 

(4 Q) (0) + df > + • • -)(4 a) (0) + 4 a) + ■■■) = !, (C17) 
from which we can extract the terms in e, which say: 

e[ a) (0)4 a) + f< Q) 4 a) (0) = (C.18) 
We are now free to choose the two constraints: 

e<°>(0)f< o) = 0, f^ a) 4 a) (0) = 0, (C.19) 

which in the special case of a symmetric matrix correspond to constraining 
the eigenvectors to be of constant length, as defined by the Euclidean norm. 

OK, now that we have defined our cast of characters, what do the defining 
equations (C.ll) and (C.9) tell us about our Taylor expansions (C.13) and 
(C.15)? We expand equation (C.ll) in e. 

(H(0)+eV+- • .)(4 a) (0)+ £ 4 a) + - • •) = (A( a >(0)+e M (a) + - ■ ■)(eR ) (0)+ef£ +- ' 

(C.20) 

Identifying the terms of order e, we have: 

H(0)4 a) + Ve^ a) (0) = A< a >(0)4 a) + M (a) 4 a) (0). ( C - 21 ) 

We can extract interesting results from this equation by hitting it with e[^(0): 

4 6) (0)H(0)f£ l) +4 b) (0)v4 Q) (0) = 4 6) (0)AM(0)f£ l) +^4 6) (0)4 Q) (0). 

A( b )e[_ b) (0)4 a) + e^(0)Ve^(0) = X (a \o)e^ (0)f { R a) + ^S ab . (C.22) 
Setting b = a we obtain 

4 a) (0)v4 a) (0) = M (a) - (C23) 
Alternatively, choosing b ^ a, we obtain: 

4 b) (0)v4 a) (0) = [a<°)(0) - A^(0)] 4 b) (0)4 a) (C24) 

=> e " )( ° )f ^ = AH(0)-A( b )(0) e " )(0) < )(0) - (C - 25) 

Now, assuming that the right-eigenvectors {e^(0)}^. 1 form a complete basis, 
we must be able to write 

f( a) =5> b 4 6) (0), (C.26) 

b 
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C.3: Perturbation theory 



where 



«, 6 = e< 6) (0)f< o) , (C.27) 



so, comparing (C.25) and (C.27), we have: 

f (a) y 4 h) (0)v4 a) (0) (6) 

fR -^AW(0)-AW(0) eR (0) ' (C - 28) 

Equations (C.23) and (C.28) are the solution to the first order perturbation 
theory problem, giving respectively the first derivative of the eigenvalue and 
the eigenvectors. 

Second-order perturbation theory 

If we expand the eigenvector equation (C.ll) to second order in e, and assume 
that the equation 

H(e) = H(0) + eV (C.29) 
is exact, that is, H is a purely linear function of e, then we have: 

(H(0) + £ V)(e^(0) + e f^ + i e 2 g ( a) + ...) 

= (AW (0) + e M (a) + \t 2 v (a) + ■ ■ -)(4 a) (0) + ef < o) + + • • •) (C.30) 

where gj^ and are the second derivatives of the eigenvector and eigenvalue. 
Equating the second-order terms in e in equation (C.30), 

VfW + iH(0)g< a) = + 1^(0) + ^WfW. (C.31) 

Hitting this equation on the left with e[^(0), we obtain: 
e[_ a) (0)v4 a) + iA( a )e[_ a) (0)g^ a) 
= £A<°>(0)e[ o) (0)gjj t o) + ^e[ a) (0)e^(0) + ^e^O)^. (C.32) 
The term (0)f^ is equal to zero because of our constraints (C.19), so 

4 a) (0)Vf<j a) = X -v {a \ (C.33) 
so the second derivative of the eigenvalue with respect to e is given by 

-i» = e< o) (0)vV e L ) (°) Ve R ) (0) e W (Q) (C34) 
2 v e L W v Z. A (a) (0 )_AW(0) R [ ) ( ' 

v [e^(0)Ve< a )(0)][e^(0)Ve^(0)] 
" £ AW(0)-AW(0) • (C - 35) 

This is as far as we will take the perturbation expansion. 
Summary 

If we introduce the abbreviation Vb a for e[^(0)Vep^(0), we can write the 
eigenvectors of H(e) = H(0) + eV to first order as 

4%) - e?(0) + 'E AM ( 0) %, (0) 4>) + (C30) 

b^a v 1 K 1 

and the eigenvalues to second order as 

AW(0) - AW(0) 



AH (e) = AH (0) + + £ , (n ^* + (C.37) 
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612 C — Some Mathematics 



► C.4 Some numbers 





2 8192 


^q2466 


Number of distinct 1-kilobyte files 




2 1024 


1Q 308 


Number of states of a 2D Ising model with 32 x 32 spins 


2 iooo 




i n301 
1U 


Number of binary strings of length 1000 


2 500 




3xl0 150 






2 469 


10 141 


Number of binary strings of length 1000 having 100 Is and 900 0s 




2 266 


1Q 80 


Number of electrons in universe 


2 200 




1.6 xlO 60 






2 190 


10 57 


Number of electrons in solar system 




2 m 


3xl0 51 


Number of electrons in the earth 


2 ioo 




1Q 30 






2 98 


3x 10 29 


Age of universe /picoseconds 




n58 
z 


Qwi nl7 
o X 1U 


Age of universe / seconds 


o50 
z 




1U 




2 40 




10 12 








10 11 


Number of neurons in human brain 






10 11 


lMmnlipr of Viil^ ^torpH on a TV\/T) 






3x 10 10 


Number of bits in the wheat genome 






6xl0 9 


Number of bits in the human genome 




2 32 


6xl0 9 


Population of earth 


2 30 




10 9 








2.5 x 10 8 


Number of fibres in the corpus callosum 






2x 10 8 


Number of bits in C Elcgans (a worm) genome 






2x 10 8 


Nnmhpr of bit<^ in Arnhirlnn^i^ th nil nun la fl nwpri n c* - nlant rplatpH to brnrrrili 1 cpnomp 

±y L1111UC1 Ul UllO LLL JTL 1 Lh U b Lh VJ IJ J L<D LI LLVLtLV 1 OLV \ Ch 11U VVClillt VJLCXiLLVi IdtlUCLl UU UlUkj^Ull 1 tCllUlllC 




2 25 


3xl0 7 


One year/seconds 






2xl0 7 


Number of bits in the compressed PostScript file that is this book 






2xl0 7 


Number of bits in unix kernel 






in 7 

1U 


Number of bits in the E. Coli genome, or in a floppy disk 






A * , i r»6 

4 x 10 


Number of years since human/chimpanzee divergence 


2 20 




10 6 


1 048 576 






9 v 1 f|5 


iy umuci oi geiiei anono since uuiudii/ cnimjjanzee ciiveigence 






^ v i r»4 


Number of genes in human genome 






3 x 10 4 


Number of genes in Arabidopsis thaliaud genome 






1.5 xlO 3 


Number of base pairs in a gene 


2 io 


e 7 


10 3 


2 10 = 1024; e 7 = 1096 


2° 




10° 


1 




2- 2 


2.5X10" 1 


Lifetime probability of dying from smoking one pack of cigarettes per day. 






1(T 2 


Lifetime probability of dying in a motor vehicle accident 


2 -io 




10~ 3 








10~ 5 


Lifetime probability of developing cancer because of drinking 2 litres per day of 








water containing 12p.p.b. benzene 


2-20 




io- 6 








3 xlO" 8 


Probability of error in transmission of coding DNA, per nucleotide, per generation 


2-30 




io- 9 




2-60 




10 -i8 


Probability of undetected error in a hard disk drive, after error correction 
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heat capacity, 401, 404 
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marginalization, 29, 295, 319 
Markov chain, 141, 168 
Markov chain Monte Carlo, see Monte 

Carlo methods 
Markov model, 111, see Markov chain 

and hidden Markov model 
marriage, 454 
matrix, 409 
matrix identities, 438 
max-product, 339 
maxent, 308, see maximum entropy 
maximum distance separable, 219 
maximum entropy, 308, 551 
maximum likelihood, 6, 300, 347 
maximum likelihood decoder, 152 
maximum a posteriori decoder, 325 
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noise. 3, see channel 
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noisy channel, see channel 
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noisy-channel coding theorem, 15, 
152, 162, 171, 229 

Gaussian channel, 181 

linear codes, 229 
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noninformative, 319 
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Newton algorithm, 441 
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perfect simulation, 413 
periodic variable, 315 
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phase transition, 361, 403, 601 
philosophy, 26, 119, 384 
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phone number, 58, 129 
photon counter, 307, 342, 448 
physics, 85 
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pigeon-hole principle, 86 
pitchfork bifurcation, 291 
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practical, 183, see error-correcting 

code, practical 
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prediction, 29, 52 
predictive distribution, 111 
prefix code, 92, 95 
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improper, 353 

subjectivity, 30 
prior equivalence, 447 
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prize, on game show, 57 
probabilistic model, 111, 120 
probabilistic movie, 551 
probability, 26, 38 
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contrasted with likelihood, 28 

density, 30, 33 
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distribution 
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sum-product algorithm 
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prospecting, 451 
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regulatory, 201, 204 
protein synthesis, 280 
protocol, 589 
pseudoinverse, 550 
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R3, see repetition code 
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random variable, 26, 463 
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random-coding exponent, 171 
random-walk Metropolis method, 388 
rant, see sermon 

confidence level, 465 
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rate, 152 
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regret, 455 
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regularization, 529, 550 
regularization constant, 479 
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rejection, 364, 366, 533 
rejection sampling, 364, see Monte 
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429, 435, 475 
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runlength-limited channel, 249 
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MAP method, 283 

maximum entropy, 308 

maximum likelihood, 306 

maximum a posteriori method, 
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p-value, 463 
sampling theory, 64 
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stopping rule, 463 
turbo codes, 581 
unbiased estimator, 307 
worst-case-ism, 207 
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Shevelev, Vladimir, 572 
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Shokrollahi, M. Amin, 568 
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Siegel, Paul, 262 
sigmoid, 473, 527 
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simplex, 173, 316 
Simpson's paradox, 355 
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simulated annealing, 379, 392, see 
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methods 
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BUGS, 371 
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Gaussian processes, 534 

hash function, 200 
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solar system, 346 
soldier, 241 

soliton distribution, 592 
sound, 187 
source code, 73, 75 

algorithms, 119, 121 

block code, 76 

block-sorting compression, 121 
Burrows- Wheeler transform, 121 
for complex sources, 353 
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for integers, 132 
Huffman, see Huffman code 
implicit probabilities, 102 
optimal lengths, 97, 102 
prefix code, 95 
software, see software 
stream codes, 110-130 
supermarket, 96, 104, 112 
symbol code, 91 

optimal, 91 
uniquely decodeable, 94 
variable symbol durations, 125, 
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source coding theorem, 78, 91, 229, 
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span, 331 
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Stirling's approximation, 1, 8 
stochastic, 472 

stochastic dynamics, see Hamiltonian 
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stochastic gradient, 476 
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stopping rule, 463 
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syndrome decoding, 11, 216, 229, 371 
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typicality, 78, 80, 162 

behaviour of evidence, 54, 60 
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